Spam Filter ISP Support Forum

  New Posts New Posts RSS Feed - Bayesian Filter
  FAQ FAQ  Forum Search   Register Register  Login Login

Bayesian Filter

 Post Reply Post Reply
Author
mikek View Drop Down
Senior Member
Senior Member
Avatar

Joined: 22 February 2005
Location: Switzerland
Status: Offline
Points: 133
Post Options Post Options   Thanks (0) Thanks(0)   Quote mikek Quote  Post ReplyReply Direct Link To This Post Topic: Bayesian Filter
    Posted: 02 March 2005 at 6:07am

I suspect something fishy with the bayesian filter. I have had it running for quite sometime now (Spam=26824, Good=15002, from corpus.ini), but my numbers in common tokens seem very low.

For example, I have the keyword "y0u" in my keywords list. There were 11 E-Mails rejected by that keyword in my current Quarantine DB (which keeps E-Mail for 3 days only). But the corpus entry for the same token only shows a value of 2:

*y0u,0,2,0.400000005960464,02.03.2005

My Tokens with the biggest values are all strange encoded values (probably from virus attachments) like:

*PC9saT4gPGxpIGNsYXNzPXNtYWxsPj,0,816,0.999899983406067,27.0 2.2005

Is this a bug or is there a rational explanation?

All e-mails always pass the bayesian filter with 0% spam probability, I have never seen another number there in the activity log.

And I would also like to see a function to mark spam that has passed through the filter to further fine-tune the filter.

Version 2.1.2.410

 

Back to Top
Cire View Drop Down
Newbie
Newbie


Joined: 24 February 2005
Status: Offline
Points: 8
Post Options Post Options   Thanks (0) Thanks(0)   Quote Cire Quote  Post ReplyReply Direct Link To This Post Posted: 02 March 2005 at 3:44pm

I am seeing the similiar results.

 

Back to Top
LogSat View Drop Down
Admin Group
Admin Group
Avatar

Joined: 25 January 2005
Location: United States
Status: Offline
Points: 4105
Post Options Post Options   Thanks (0) Thanks(0)   Quote LogSat Quote  Post ReplyReply Direct Link To This Post Posted: 02 March 2005 at 5:09pm
The Bayesian filter is case sensitive, so y0u, Y0u, y0U, Y0U are all different statistically, while the keyword is not. So the 11 hits in the keyword filter should also include entries for the 4 alternate cases. Please check the corpus file to see if the other 4 variations are present.
As far as the "odd" token, spammers are known to add gibberish to emails, so it is normal to see those entries. What is important is that when that "word" is found in an email, there's 99.989998% chances it is spam according to the token, so it will help in categorizing the email as spam.

Can you please post the statistics for the emails blocked by the various filters from the "Statistics" tab so we can have an idea of the number of emails blocked?
Roberto Franceschetti

LogSat Software

Spam Filter ISP
Back to Top
mikek View Drop Down
Senior Member
Senior Member
Avatar

Joined: 22 February 2005
Location: Switzerland
Status: Offline
Points: 133
Post Options Post Options   Thanks (0) Thanks(0)   Quote mikek Quote  Post ReplyReply Direct Link To This Post Posted: 03 March 2005 at 2:58am

There are no other variants of "y0u" in other cases in the corpus db...

My statistics show that there has never been a mail being rejected by the bayesian filter so far.

Statistics from the current quarantine (kept for 3 days, still running as a test with only a few domains, so the numbers aren't that high, but have been running for several months now):

895 IP found in MAPS search
750 Exceeded maximum number of RCPT TO
340 Keywords found in content
221 Invalid sender domain MX Record
198 SPF Sender Policy Framework match

 

Back to Top
 Post Reply Post Reply
  Share Topic   

Forum Jump Forum Permissions View Drop Down



This page was generated in 0.328 seconds.