Bayesian Filter |
Post Reply
|
| Author | |
mikek
Senior Member
Joined: 22 February 2005 Location: Switzerland Status: Offline Points: 133 |
Post Options
Thanks(0)
Quote Reply
Topic: Bayesian FilterPosted: 02 March 2005 at 6:07am |
|
I suspect something fishy with the bayesian filter. I have had it running for quite sometime now (Spam=26824, Good=15002, from corpus.ini), but my numbers in common tokens seem very low. For example, I have the keyword "y0u" in my keywords list. There were 11 E-Mails rejected by that keyword in my current Quarantine DB (which keeps E-Mail for 3 days only). But the corpus entry for the same token only shows a value of 2: *y0u,0,2,0.400000005960464,02.03.2005 My Tokens with the biggest values are all strange encoded values (probably from virus attachments) like: *PC9saT4gPGxpIGNsYXNzPXNtYWxsPj,0,816,0.999899983406067,27.0 2.2005 Is this a bug or is there a rational explanation? All e-mails always pass the bayesian filter with 0% spam probability, I have never seen another number there in the activity log. And I would also like to see a function to mark spam that has passed through the filter to further fine-tune the filter. Version 2.1.2.410
|
|
![]() |
|
Cire
Newbie
Joined: 24 February 2005 Status: Offline Points: 8 |
Post Options
Thanks(0)
Quote Reply
Posted: 02 March 2005 at 3:44pm |
|
I am seeing the similiar results.
|
|
![]() |
|
LogSat
Admin Group
Joined: 25 January 2005 Location: United States Status: Offline Points: 4106 |
Post Options
Thanks(0)
Quote Reply
Posted: 02 March 2005 at 5:09pm |
|
The Bayesian filter is case sensitive, so y0u, Y0u, y0U, Y0U are all
different statistically, while the keyword is not. So the 11 hits in
the keyword filter should also include entries for the 4 alternate
cases. Please check the corpus file to see if the other 4 variations
are present.
As far as the "odd" token, spammers are known to add gibberish to emails, so it is normal to see those entries. What is important is that when that "word" is found in an email, there's 99.989998% chances it is spam according to the token, so it will help in categorizing the email as spam. Can you please post the statistics for the emails blocked by the various filters from the "Statistics" tab so we can have an idea of the number of emails blocked? |
|
![]() |
|
mikek
Senior Member
Joined: 22 February 2005 Location: Switzerland Status: Offline Points: 133 |
Post Options
Thanks(0)
Quote Reply
Posted: 03 March 2005 at 2:58am |
|
There are no other variants of "y0u" in other cases in the corpus db... My statistics show that there has never been a mail being rejected by the bayesian filter so far. Statistics from the current quarantine (kept for 3 days, still running as a test with only a few domains, so the numbers aren't that high, but have been running for several months now): 895 IP found in MAPS search
|
|
![]() |
|
Post Reply
|
|
|
Tweet
|
| Forum Jump | Forum Permissions ![]() You cannot post new topics in this forum You cannot reply to topics in this forum You cannot delete your posts in this forum You cannot edit your posts in this forum You cannot create polls in this forum You cannot vote in polls in this forum |
This page was generated in 0.746 seconds.


Topic Options
Post Options
Thanks(0)


