Print Page | Close Window

Bayesian Filter

Printed From: LogSat Software
Category: Spam Filter ISP
Forum Name: Spam Filter ISP Support
Forum Description: General support for Spam Filter ISP
URL: http://www.logsat.com/spamfilter/forums/forum_posts.asp?TID=5082
Printed Date: 23 October 2017 at 10:20pm


Topic: Bayesian Filter
Posted By: mikek
Subject: Bayesian Filter
Date Posted: 02 March 2005 at 6:07am

I suspect something fishy with the bayesian filter. I have had it running for quite sometime now (Spam=26824, Good=15002, from corpus.ini), but my numbers in common tokens seem very low.

For example, I have the keyword "y0u" in my keywords list. There were 11 E-Mails rejected by that keyword in my current Quarantine DB (which keeps E-Mail for 3 days only). But the corpus entry for the same token only shows a value of 2:

*y0u,0,2,0.400000005960464,02.03.2005

My Tokens with the biggest values are all strange encoded values (probably from virus attachments) like:

*PC9saT4gPGxpIGNsYXNzPXNtYWxsPj,0,816,0.999899983406067,27.0 2.2005

Is this a bug or is there a rational explanation?

All e-mails always pass the bayesian filter with 0% spam probability, I have never seen another number there in the activity log.

And I would also like to see a function to mark spam that has passed through the filter to further fine-tune the filter.

Version 2.1.2.410

 




Replies:
Posted By: Cire
Date Posted: 02 March 2005 at 3:44pm

I am seeing the similiar results.

 



Posted By: LogSat
Date Posted: 02 March 2005 at 5:09pm
The Bayesian filter is case sensitive, so y0u, Y0u, y0U, Y0U are all different statistically, while the keyword is not. So the 11 hits in the keyword filter should also include entries for the 4 alternate cases. Please check the corpus file to see if the other 4 variations are present.
As far as the "odd" token, spammers are known to add gibberish to emails, so it is normal to see those entries. What is important is that when that "word" is found in an email, there's 99.989998% chances it is spam according to the token, so it will help in categorizing the email as spam.

Can you please post the statistics for the emails blocked by the various filters from the "Statistics" tab so we can have an idea of the number of emails blocked?


-------------
Roberto Franceschetti

http://www.logsat.com" rel="nofollow - LogSat Software

http://www.logsat.com/sfi-spam-filter.asp" rel="nofollow - Spam Filter ISP


Posted By: mikek
Date Posted: 03 March 2005 at 2:58am

There are no other variants of "y0u" in other cases in the corpus db...

My statistics show that there has never been a mail being rejected by the bayesian filter so far.

Statistics from the current quarantine (kept for 3 days, still running as a test with only a few domains, so the numbers aren't that high, but have been running for several months now):

895 IP found in MAPS search
750 Exceeded maximum number of RCPT TO
340 Keywords found in content
221 Invalid sender domain MX Record
198 SPF Sender Policy Framework match

 




Print Page | Close Window