LogSat Software

Message Topic Search Topic Options Post Reply Create New Topic Printable Version Translate Topic

   I am trying to better understand how Spam Filters Bayesian works and if it is actually improving over time.
My question is where does the Corpus db actually learn new words and how does it weight them ?
Does it do this from only Keywords that are entered ?
Can it also scan and learn from all quarantined mail ?
It would seem to me that the Bayesian filter should scan every rejected email and consider that to be spam. For example if spam is sent to an unknown address shouldn't the filter add words from those emails to it's db ?
Lee

Author	Message Topic Search Topic Options Post Reply Create New Topic Printable Version Translate Topic
Lee Members Profile Send Private Message Find Members Posts Add to Buddy List Groupie Joined: 04 February 2005 Location: United States Status: Offline Points: 50	Post Options Post Reply Quote Lee Report Post Thanks(0) Quote Reply Topic: How Does Bayesian Learn ? Posted: 13 December 2004 at 4:03pm
	I am trying to better understand how Spam Filters Bayesian works and if it is actually improving over time. My question is where does the Corpus db actually learn new words and how does it weight them ? Does it do this from only Keywords that are entered ? Can it also scan and learn from all quarantined mail ? It would seem to me that the Bayesian filter should scan every rejected email and consider that to be spam. For example if spam is sent to an unknown address shouldn't the filter add words from those emails to it's db ? Lee

LogSat Members Profile Send Private Message Find Members Posts Add to Buddy List Admin Group Joined: 25 January 2005 Location: United States Status: Offline Points: 4106	Post Options Post Reply Quote LogSat Report Post Thanks(0) Quote Reply Posted: 13 December 2004 at 10:11pm
	Lee, SpamFilter examines every single email that it receives (good and spam), and breaks it apart into tokens. Tokens from spam email are weighed and assigned a certain score, tokens from clean emails are weighed differently and assigned a different score. All tokens are inserted into the corpus database, and as new emails arrive, the corpus is updated in realtime and reloaded by SpamFilter every 10 minutes or so. The tokens are retrieved from the email's source, not from any of the keywords or filters you have configured. The SpamFilter keywords and filters simply help the statistical filter determine what is spam or not, they are not added to the corpus. Furthermore, when a user force-delivers a valid email that was mistakenly added to the quarantine, that is also further processed by SpamFilter, and its tokens are "tagged" to reflect a false positive, and the corpus database is updated to account for that fact to reduce the chance of a similar mistake happening in the future. Roberto F. LogSat Software

mikek Members Profile Send Private Message Find Members Posts Add to Buddy List Senior Member Joined: 22 February 2005 Location: Switzerland Status: Offline Points: 133	Post Options Post Reply Quote mikek Report Post Thanks(0) Quote Reply Posted: 16 December 2004 at 3:55am
	I am still convinced that for the bayesian filter to be effective, there should be some way to report false-negatives. Because right now, if you don't keep a lengthy list of keywords, too much spam will still get through, being tagged as "good" e-mail in the bayesian filter and therefore rendering the bayesian filter useless.

keizersozay Members Profile Send Private Message Find Members Posts Add to Buddy List Groupie Joined: 26 January 2005 Location: United States Status: Offline Points: 77	Post Options Post Reply Quote keizersozay Report Post Thanks(0) Quote Reply Posted: 16 December 2004 at 2:37pm
	I've been thinking about what you said and I am going to try something. Hopefully Roberto can tell me if this will work or not too. What about adding an email adddress to spamfilter in the 'blacklist to' file for say 'reportspam@spam.mycompanyname' or something like that and add the ':nondr' option. notice I am not using a real domain. Now setup a distrobution list in your email server called 'reportspam' and include the above email address. (you may have to first set the above email address as a contact, then add the contact to the disto list) now setup your email server to forward email for that email address to SpamFilter using a smarthost...I think that can be done... This would automatically add the contents of the email to the baysfilter and help it be more effective.

LogSat Members Profile Send Private Message Find Members Posts Add to Buddy List Admin Group Joined: 25 January 2005 Location: United States Status: Offline Points: 4106	Post Options Post Reply Quote LogSat Report Post Thanks(0) Quote Reply Posted: 17 December 2004 at 12:08am
	If I understood your idea correctly, end-users would be forwarding the spam to the reporting email. If so, unfortunately I do not believe that will yield accurate results. The statistical filter works on the email's source, and if a user has Microsoft Outlook for example, that client completely changes the email's source. WHen the user forward a message, it's completely different than the original, so new similar original messages won't even see the tokens the end-user has forwarded. Unless the user is able to forward the full, unmodified, original email's source this will not work as expected. Roberto F. LogSat Software

LogSat Members Profile Send Private Message Find Members Posts Add to Buddy List Admin Group Joined: 25 January 2005 Location: United States Status: Offline Points: 4106	Post Options Post Reply Quote LogSat Report Post Thanks(0) Quote Reply Posted: 17 December 2004 at 12:13am
	Mike, While your statement is correct, SpamFilter's statistical filter could be trained better if we could feed it false-negatives, please note that a properly configured SpamFilter will block huge amounts of spam. During the past months for example, our own server has the following email counts: [Messages] Spam=9105146 Good=559984 We receive 20x more spam than good emails. Still, under these conditions, during the past 3 days we blocked 100,000 emails using MAPS filter, 90,000 using blacklisted countries, and only 1,000 using Bayesian filtering. The statistical filter is one of the last ones to be used as email arrives, other filters are applied first. And even though 1,000 may seem small compared to 100,000, it still means that we received one thousand less spam emails in three days.... Roberto F. LogSat Software

LogSat Software

Site Navigation[Skip]

Spam Filter ISP Support Forum

How Does Bayesian Learn ?