Bayesian Filter |
Post Reply ![]() |
Author | |
mikek ![]() Senior Member ![]() ![]() Joined: 22 February 2005 Location: Switzerland Status: Offline Points: 133 |
![]() ![]() ![]() ![]() ![]() Posted: 02 March 2005 at 6:07am |
I suspect something fishy with the bayesian filter. I have had it running for quite sometime now (Spam=26824, Good=15002, from corpus.ini), but my numbers in common tokens seem very low. For example, I have the keyword "y0u" in my keywords list. There were 11 E-Mails rejected by that keyword in my current Quarantine DB (which keeps E-Mail for 3 days only). But the corpus entry for the same token only shows a value of 2: *y0u,0,2,0.400000005960464,02.03.2005 My Tokens with the biggest values are all strange encoded values (probably from virus attachments) like: *PC9saT4gPGxpIGNsYXNzPXNtYWxsPj,0,816,0.999899983406067,27.0 2.2005 Is this a bug or is there a rational explanation? All e-mails always pass the bayesian filter with 0% spam probability, I have never seen another number there in the activity log. And I would also like to see a function to mark spam that has passed through the filter to further fine-tune the filter. Version 2.1.2.410
|
|
![]() |
|
Cire ![]() Newbie ![]() Joined: 24 February 2005 Status: Offline Points: 8 |
![]() ![]() ![]() ![]() ![]() |
I am seeing the similiar results.
|
|
![]() |
|
LogSat ![]() Admin Group ![]() ![]() Joined: 25 January 2005 Location: United States Status: Offline Points: 4105 |
![]() ![]() ![]() ![]() ![]() |
The Bayesian filter is case sensitive, so y0u, Y0u, y0U, Y0U are all
different statistically, while the keyword is not. So the 11 hits in
the keyword filter should also include entries for the 4 alternate
cases. Please check the corpus file to see if the other 4 variations
are present.
As far as the "odd" token, spammers are known to add gibberish to emails, so it is normal to see those entries. What is important is that when that "word" is found in an email, there's 99.989998% chances it is spam according to the token, so it will help in categorizing the email as spam. Can you please post the statistics for the emails blocked by the various filters from the "Statistics" tab so we can have an idea of the number of emails blocked? |
|
![]() |
|
mikek ![]() Senior Member ![]() ![]() Joined: 22 February 2005 Location: Switzerland Status: Offline Points: 133 |
![]() ![]() ![]() ![]() ![]() |
There are no other variants of "y0u" in other cases in the corpus db... My statistics show that there has never been a mail being rejected by the bayesian filter so far. Statistics from the current quarantine (kept for 3 days, still running as a test with only a few domains, so the numbers aren't that high, but have been running for several months now): 895 IP found in MAPS search
|
|
![]() |
Post Reply ![]() |
|
Tweet
|
Forum Jump | Forum Permissions ![]() You cannot post new topics in this forum You cannot reply to topics in this forum You cannot delete your posts in this forum You cannot edit your posts in this forum You cannot create polls in this forum You cannot vote in polls in this forum |
This page was generated in 0.328 seconds.