Spam Filter ISP Support Forum

  New Posts New Posts RSS Feed - Bayesian Filter Rejecting Good Mail
  FAQ FAQ  Forum Search   Register Register  Login Login

Bayesian Filter Rejecting Good Mail

 Post Reply Post Reply
Author
StevenJohns View Drop Down
Senior Member
Senior Member


Joined: 03 August 2006
Status: Offline
Points: 119
Post Options Post Options   Thanks (0) Thanks(0)   Quote StevenJohns Quote  Post ReplyReply Direct Link To This Post Topic: Bayesian Filter Rejecting Good Mail
    Posted: 08 November 2007 at 4:08am
Hi Roberto,
 
Recently I have seen that the bayesian filter appears to be rejecting a lot of emails that are obviously not spam.
I have the filter set to 99.5% but it seems to be giving a score of 100% for these emails.
I have  PM'ed you one of the offending emails and the results of the corpus after I pasted this email into the probability page.
I think this may be due to a corrupt bayesian database, but i'm not sure.
I am running 3.5.4.718 in standard ISP mode (not enterprise)
 
Cheers
 
 
 
Back to Top
LogSat View Drop Down
Admin Group
Admin Group
Avatar

Joined: 25 January 2005
Location: United States
Status: Offline
Points: 4068
Post Options Post Options   Thanks (0) Thanks(0)   Quote LogSat Quote  Post ReplyReply Direct Link To This Post Posted: 08 November 2007 at 10:18pm
StevenJohns,

We've been examining the debug output in one of the PMs, and while it does not look like the Bayesian filter is corrupted, it does appear that it has lost accuracy. The filter "learns as you go", but if users do not occasionally force the delivery  of good emails that were blocked by mistake, the Bayesian filter could mis-learn and become less accurate.

I would recommend to reset the bayesian database so that you start with a fresh statistical database. To do so, please stop SpamFilter, delete or rename the \SpamFilter\corpus directory, then restart SpamFilter.
Roberto Franceschetti

LogSat Software

Spam Filter ISP
Back to Top
StevenJohns View Drop Down
Senior Member
Senior Member


Joined: 03 August 2006
Status: Offline
Points: 119
Post Options Post Options   Thanks (0) Thanks(0)   Quote StevenJohns Quote  Post ReplyReply Direct Link To This Post Posted: 09 November 2007 at 3:57am
Thought you would say that...I did rename the corpus folder and keep the origional database just in case it was corrupt. The new DB is working fine now. How long before it kicks into action?? i.e. how many emails does it take before it starts working again?
Also, is there any way that I can force train the dayabase??  I have several hundred thousand spam and ham emails which I would like to use to train the bayesian filter in the same way that you can manually trian SpamAssassin.
 
Cheers
Back to Top
LogSat View Drop Down
Admin Group
Admin Group
Avatar

Joined: 25 January 2005
Location: United States
Status: Offline
Points: 4068
Post Options Post Options   Thanks (0) Thanks(0)   Quote LogSat Quote  Post ReplyReply Direct Link To This Post Posted: 10 November 2007 at 4:28pm
By default, the Bayesian filter starts to block emails after receiving 5,000 spam and 5,000 good emails.
the only way to "train" the database is by forcing the delivery of emails that were blocked by mistake. This teaches the filter what emails were valid. The opposite (teaching what emails were spam) is not currently possible. This is because after an email is delivered, depending on the email client used, it may be impossible for the end-user to have access to the original, unmodified source of the email (Microsoft Outlook for example *completely* alters the email's source). The only solution would be for SpamFilter to retain a copy of the original emails that were delivered, and that is something we've been looking at for a while, but still have not decided if it is a good idea or not.
Roberto Franceschetti

LogSat Software

Spam Filter ISP
Back to Top
StevenJohns View Drop Down
Senior Member
Senior Member


Joined: 03 August 2006
Status: Offline
Points: 119
Post Options Post Options   Thanks (0) Thanks(0)   Quote StevenJohns Quote  Post ReplyReply Direct Link To This Post Posted: 10 November 2007 at 4:42pm
We actually use SF in the tag and deliver mode, forwarding all emails to an internal server for futher processing. We therefore keep ALL emails, both ham and spam and therefore do indeed have access to the origional email. Knowing this, is that any way that I can retrain the corpus as we have several hundred thousand emails of each that has proved to be a good corpus for us. Wiping that out because the corpus got corrupt is going to be a HUGE pain for us.
 
 
Back to Top
LogSat View Drop Down
Admin Group
Admin Group
Avatar

Joined: 25 January 2005
Location: United States
Status: Offline
Points: 4068
Post Options Post Options   Thanks (0) Thanks(0)   Quote LogSat Quote  Post ReplyReply Direct Link To This Post Posted: 10 November 2007 at 5:10pm
If you do have the *original* emails, and know which is spam and which is not, you could in theory re-send all the emails back to SpamFilter (we'd suggest installing a separate copy of SpamFilter somewhere else, with a "fake" destination SMTP server so emails won't be delivered again, and starting with a blank bayesian database). You could then first send all the "good" emails, possibly whitelisting the sender's IP address so as to tell SpamFilter they are "clean". You could then send all the spam ones, this time blacklisting the IP to tell SpamFilter they are spam.
We'll need to double-check to ensure that manually blacklisting and manually whitelisting IPs will still cause SpamFilter to "teach" the bayesian filter about these emails, as there's a chance that this manual intervention may cause emails to be skipped by the learning process.
Roberto Franceschetti

LogSat Software

Spam Filter ISP
Back to Top
StevenJohns View Drop Down
Senior Member
Senior Member


Joined: 03 August 2006
Status: Offline
Points: 119
Post Options Post Options   Thanks (0) Thanks(0)   Quote StevenJohns Quote  Post ReplyReply Direct Link To This Post Posted: 10 November 2007 at 6:06pm
OK, if you can chack if the manual intervention will cause an issue, I will setup another server as suggested.
 
Cheers
Back to Top
LogSat View Drop Down
Admin Group
Admin Group
Avatar

Joined: 25 January 2005
Location: United States
Status: Offline
Points: 4068
Post Options Post Options   Thanks (0) Thanks(0)   Quote LogSat Quote  Post ReplyReply Direct Link To This Post Posted: 12 November 2007 at 8:53am
We confirm that in both cases the emails will be passed thru the bayesian learning process. For the one where the sender's IP is blacklisted however, you will need to ensure that the emails will be quarantined. Only if they are quarantined will the emails be forwarded to the "learning" process. The quarantine database must then be enabled, and the option to "do not quarantine" for the blacklisted IP filters must not be checked. Please note that the quarantine database can be a different database than your "real" production data, if you are performing this on a separate instance of SpamFilter. 
Roberto Franceschetti

LogSat Software

Spam Filter ISP
Back to Top
StevenJohns View Drop Down
Senior Member
Senior Member


Joined: 03 August 2006
Status: Offline
Points: 119
Post Options Post Options   Thanks (0) Thanks(0)   Quote StevenJohns Quote  Post ReplyReply Direct Link To This Post Posted: 12 November 2007 at 9:01am
Roberto,
Can you please clarify the following statement for me..."Only if they are quarantined will the emails be forwarded to the "learning" process."
 
Our normal setup is that we do NOT use the SF quarantine database, instead we tag the emails and pass them on to another server for futher processing and use our own quarantine. In doing this, if I read your last statement correctly, have our emails never been learnt as spam??
 
 
Back to Top
LogSat View Drop Down
Admin Group
Admin Group
Avatar

Joined: 25 January 2005
Location: United States
Status: Offline
Points: 4068
Post Options Post Options   Thanks (0) Thanks(0)   Quote LogSat Quote  Post ReplyReply Direct Link To This Post Posted: 12 November 2007 at 9:15am
My fault. I should have said "quarantined or delivered". Only in those cases does the bayesian filter actually see the contents of the email and will thus analyze them. "Tagging" does cause them to be delivered so yes, they will be going thru the learning process.
Roberto Franceschetti

LogSat Software

Spam Filter ISP
Back to Top
StevenJohns View Drop Down
Senior Member
Senior Member


Joined: 03 August 2006
Status: Offline
Points: 119
Post Options Post Options   Thanks (0) Thanks(0)   Quote StevenJohns Quote  Post ReplyReply Direct Link To This Post Posted: 12 November 2007 at 9:20am
Ok cool....you had me worried there for a minute !!!
 
Will start the process shortly and let you know how it gets on.
 
Back to Top
WebGuyz View Drop Down
Senior Member
Senior Member


Joined: 09 May 2005
Location: United States
Status: Offline
Points: 348
Post Options Post Options   Thanks (0) Thanks(0)   Quote WebGuyz Quote  Post ReplyReply Direct Link To This Post Posted: 12 November 2007 at 10:13am
Originally posted by LogSat LogSat wrote:

.... re-send all the emails back to SpamFilter (we'd suggest installing a separate copy of SpamFilter somewhere else...
 
The bayes is server specific. if you set up a second copy that that second server will have the correct bayes info but the primary one will not. Too bad there is no way to share bayes info amongst multiple SFE's.
 
Still think greylisting is the best primary defense even if it is slower amongst multiple SFE's. We'll just get a faster box ;-)
http://www.webguyz.net
Back to Top
StevenJohns View Drop Down
Senior Member
Senior Member


Joined: 03 August 2006
Status: Offline
Points: 119
Post Options Post Options   Thanks (0) Thanks(0)   Quote StevenJohns Quote  Post ReplyReply Direct Link To This Post Posted: 12 November 2007 at 10:44am
Roberto,
Can you confirm that the bayes files from one instance of SF on machine A will work if I copy them over to another instance of SF on machine B ??
 
 
Back to Top
LogSat View Drop Down
Admin Group
Admin Group
Avatar

Joined: 25 January 2005
Location: United States
Status: Offline
Points: 4068
Post Options Post Options   Thanks (0) Thanks(0)   Quote LogSat Quote  Post ReplyReply Direct Link To This Post Posted: 12 November 2007 at 2:40pm
If you have two "live" servers running SpamFilter and receiving emails in real-time, for example a primary MX and a secondary MX server, WebGuyz is correct. It's not that they "won't work", but rather it has to do with statistics. Most legitimate mail servers will send emails to your primary MX server only. Spammers will send spam to both servers. This means that, statistically, the emails received by your primary MX server will be *very* different than the emails received by the secondary MX server. For this reason, copying the Bayesian statistical database (which is build by examining the types of emails received, and marking incoming emails by comparing them to the "average" emails being received) between those two servers will often result in completely incorrect results.

In your case, you are re-submitting emails that have been already received by your single, live server, and are allowing the Bayesian filter to re-process them. There are bound to be some inaccuracies, for example the fact that the the bayesian filter keeps track of the time the various words were received, while if you submit them all at once this timestamp will be inaccurate. However the timestamp is used to "age" old words that are no longer being received, and to eventually remove them from the database. We don't think this will cause huge inaccuracies if you submit them all at once rather than in the spam of serveral days... But again, please note that the process you are performing has never been done before, and that we did recommend to start from scratch...!
Roberto Franceschetti

LogSat Software

Spam Filter ISP
Back to Top
StevenJohns View Drop Down
Senior Member
Senior Member


Joined: 03 August 2006
Status: Offline
Points: 119
Post Options Post Options   Thanks (0) Thanks(0)   Quote StevenJohns Quote  Post ReplyReply Direct Link To This Post Posted: 12 November 2007 at 3:06pm
OK I see now.
 
I will perform the retrain possably tomorrow and will let you know how it goes. If it does indeed not work, then as you say, I can always delete the corpus and star from scratch, but I'd rather have a go at this first....just in case it doesactually work.
 
Cheers
Back to Top
 Post Reply Post Reply
  Share Topic   

Forum Jump Forum Permissions View Drop Down



This page was generated in 0.074 seconds.