Print Page | Close Window

Bayesian Filter Rejecting Good Mail

Printed From: LogSat Software
Category: Spam Filter ISP
Forum Name: Spam Filter ISP Support
Forum Description: General support for Spam Filter ISP
URL: http://www.logsat.com/spamfilter/forums/forum_posts.asp?TID=6288
Printed Date: 12 December 2017 at 4:59pm


Topic: Bayesian Filter Rejecting Good Mail
Posted By: StevenJohns
Subject: Bayesian Filter Rejecting Good Mail
Date Posted: 08 November 2007 at 4:08am
Hi Roberto,
 
Recently I have seen that the bayesian filter appears to be rejecting a lot of emails that are obviously not spam.
I have the filter set to 99.5% but it seems to be giving a score of 100% for these emails.
I have  PM'ed you one of the offending emails and the results of the corpus after I pasted this email into the probability page.
I think this may be due to a corrupt bayesian database, but i'm not sure.
I am running 3.5.4.718 in standard ISP mode (not enterprise)
 
Cheers
 
 
 



Replies:
Posted By: LogSat
Date Posted: 08 November 2007 at 10:18pm
StevenJohns,

We've been examining the debug output in one of the PMs, and while it does not look like the Bayesian filter is corrupted, it does appear that it has lost accuracy. The filter "learns as you go", but if users do not occasionally force the delivery  of good emails that were blocked by mistake, the Bayesian filter could mis-learn and become less accurate.

I would recommend to reset the bayesian database so that you start with a fresh statistical database. To do so, please stop SpamFilter, delete or rename the \SpamFilter\corpus directory, then restart SpamFilter.


-------------
Roberto Franceschetti

http://www.logsat.com" rel="nofollow - LogSat Software

http://www.logsat.com/sfi-spam-filter.asp" rel="nofollow - Spam Filter ISP


Posted By: StevenJohns
Date Posted: 09 November 2007 at 3:57am
Thought you would say that...I did rename the corpus folder and keep the origional database just in case it was corrupt. The new DB is working fine now. How long before it kicks into action?? i.e. how many emails does it take before it starts working again?
Also, is there any way that I can force train the dayabase??  I have several hundred thousand spam and ham emails which I would like to use to train the bayesian filter in the same way that you can manually trian SpamAssassin.
 
Cheers


Posted By: LogSat
Date Posted: 10 November 2007 at 4:28pm
By default, the Bayesian filter starts to block emails after receiving 5,000 spam and 5,000 good emails.
the only way to "train" the database is by forcing the delivery of emails that were blocked by mistake. This teaches the filter what emails were valid. The opposite (teaching what emails were spam) is not currently possible. This is because after an email is delivered, depending on the email client used, it may be impossible for the end-user to have access to the original, unmodified source of the email (Microsoft Outlook for example *completely* alters the email's source). The only solution would be for SpamFilter to retain a copy of the original emails that were delivered, and that is something we've been looking at for a while, but still have not decided if it is a good idea or not.


-------------
Roberto Franceschetti

http://www.logsat.com" rel="nofollow - LogSat Software

http://www.logsat.com/sfi-spam-filter.asp" rel="nofollow - Spam Filter ISP


Posted By: StevenJohns
Date Posted: 10 November 2007 at 4:42pm
We actually use SF in the tag and deliver mode, forwarding all emails to an internal server for futher processing. We therefore keep ALL emails, both ham and spam and therefore do indeed have access to the origional email. Knowing this, is that any way that I can retrain the corpus as we have several hundred thousand emails of each that has proved to be a good corpus for us. Wiping that out because the corpus got corrupt is going to be a HUGE pain for us.
 
 


Posted By: LogSat
Date Posted: 10 November 2007 at 5:10pm
If you do have the *original* emails, and know which is spam and which is not, you could in theory re-send all the emails back to SpamFilter (we'd suggest installing a separate copy of SpamFilter somewhere else, with a "fake" destination SMTP server so emails won't be delivered again, and starting with a blank bayesian database). You could then first send all the "good" emails, possibly whitelisting the sender's IP address so as to tell SpamFilter they are "clean". You could then send all the spam ones, this time blacklisting the IP to tell SpamFilter they are spam.
We'll need to double-check to ensure that manually blacklisting and manually whitelisting IPs will still cause SpamFilter to "teach" the bayesian filter about these emails, as there's a chance that this manual intervention may cause emails to be skipped by the learning process.


-------------
Roberto Franceschetti

http://www.logsat.com" rel="nofollow - LogSat Software

http://www.logsat.com/sfi-spam-filter.asp" rel="nofollow - Spam Filter ISP


Posted By: StevenJohns
Date Posted: 10 November 2007 at 6:06pm
OK, if you can chack if the manual intervention will cause an issue, I will setup another server as suggested.
 
Cheers


Posted By: LogSat
Date Posted: 12 November 2007 at 8:53am
We confirm that in both cases the emails will be passed thru the bayesian learning process. For the one where the sender's IP is blacklisted however, you will need to ensure that the emails will be quarantined. Only if they are quarantined will the emails be forwarded to the "learning" process. The quarantine database must then be enabled, and the option to "do not quarantine" for the blacklisted IP filters must not be checked. Please note that the quarantine database can be a different database than your "real" production data, if you are performing this on a separate instance of SpamFilter. 

-------------
Roberto Franceschetti

http://www.logsat.com" rel="nofollow - LogSat Software

http://www.logsat.com/sfi-spam-filter.asp" rel="nofollow - Spam Filter ISP


Posted By: StevenJohns
Date Posted: 12 November 2007 at 9:01am
Roberto,
Can you please clarify the following statement for me..."Only if they are quarantined will the emails be forwarded to the "learning" process."
 
Our normal setup is that we do NOT use the SF quarantine database, instead we tag the emails and pass them on to another server for futher processing and use our own quarantine. In doing this, if I read your last statement correctly, have our emails never been learnt as spam??
 
 


Posted By: LogSat
Date Posted: 12 November 2007 at 9:15am
My fault. I should have said "quarantined or delivered". Only in those cases does the bayesian filter actually see the contents of the email and will thus analyze them. "Tagging" does cause them to be delivered so yes, they will be going thru the learning process.

-------------
Roberto Franceschetti

http://www.logsat.com" rel="nofollow - LogSat Software

http://www.logsat.com/sfi-spam-filter.asp" rel="nofollow - Spam Filter ISP


Posted By: StevenJohns
Date Posted: 12 November 2007 at 9:20am
Ok cool....you had me worried there for a minute !!!
 
Will start the process shortly and let you know how it gets on.
 


Posted By: WebGuyz
Date Posted: 12 November 2007 at 10:13am
Originally posted by LogSat LogSat wrote:

.... re-send all the emails back to SpamFilter (we'd suggest installing a separate copy of SpamFilter somewhere else...
 
The bayes is server specific. if you set up a second copy that that second server will have the correct bayes info but the primary one will not. Too bad there is no way to share bayes info amongst multiple SFE's.
 
Still think greylisting is the best primary defense even if it is slower amongst multiple SFE's. We'll just get a faster box ;-)


-------------
http://www.webguyz.net


Posted By: StevenJohns
Date Posted: 12 November 2007 at 10:44am
Roberto,
Can you confirm that the bayes files from one instance of SF on machine A will work if I copy them over to another instance of SF on machine B ??
 
 


Posted By: LogSat
Date Posted: 12 November 2007 at 2:40pm
If you have two "live" servers running SpamFilter and receiving emails in real-time, for example a primary MX and a secondary MX server, WebGuyz is correct. It's not that they "won't work", but rather it has to do with statistics. Most legitimate mail servers will send emails to your primary MX server only. Spammers will send spam to both servers. This means that, statistically, the emails received by your primary MX server will be *very* different than the emails received by the secondary MX server. For this reason, copying the Bayesian statistical database (which is build by examining the types of emails received, and marking incoming emails by comparing them to the "average" emails being received) between those two servers will often result in completely incorrect results.

In your case, you are re-submitting emails that have been already received by your single, live server, and are allowing the Bayesian filter to re-process them. There are bound to be some inaccuracies, for example the fact that the the bayesian filter keeps track of the time the various words were received, while if you submit them all at once this timestamp will be inaccurate. However the timestamp is used to "age" old words that are no longer being received, and to eventually remove them from the database. We don't think this will cause huge inaccuracies if you submit them all at once rather than in the spam of serveral days... But again, please note that the process you are performing has never been done before, and that we did recommend to start from scratch...!


-------------
Roberto Franceschetti

http://www.logsat.com" rel="nofollow - LogSat Software

http://www.logsat.com/sfi-spam-filter.asp" rel="nofollow - Spam Filter ISP


Posted By: StevenJohns
Date Posted: 12 November 2007 at 3:06pm
OK I see now.
 
I will perform the retrain possably tomorrow and will let you know how it goes. If it does indeed not work, then as you say, I can always delete the corpus and star from scratch, but I'd rather have a go at this first....just in case it doesactually work.
 
Cheers



Print Page | Close Window