Print Page | Close Window

corpus database

Printed From: LogSat Software
Category: Spam Filter ISP
Forum Name: Spam Filter ISP Support
Forum Description: General support for Spam Filter ISP
URL: https://www.logsat.com/spamfilter/forums/forum_posts.asp?TID=2698
Printed Date: 01 May 2025 at 9:21pm


Topic: corpus database
Posted By: kspare
Subject: corpus database
Date Posted: 14 January 2004 at 9:13pm

Hi Roberto, I know this has been brought up before, but are there still any plans to put the corpus database into the actual database of choice? I ask this because I am now running redundant servers. If I was to keep all traffic on one server it would maximize the ability of the filter. But if that server were to go down and traffic came accross the 2nd spam server, the filter will not work nearly as good....it would be nice to see if this could be implemented?

Kevin




Replies:
Posted By: Guests
Date Posted: 15 January 2004 at 10:24pm
Why not just periodically copy for Corpus database over to the other server?


Posted By: LogSat
Date Posted: 15 January 2004 at 10:44pm

Kevin,

We had actually tried implementing a fully database-based corpus during our alpha development. That would have been a preferred solution for us as well, but performance testing showed HUGE problems. Neither MS SQL nor MySQL were even close to being able to handle the massive quantity of queries needed to scan for and update the corpus.

Regarding the issue of having two separate corpii databases, we actually encourage that to be the case, since separate servers, especially in a primary/secondary MX situation, receive different emails and thus the statistics are different. But you do bring up a very valid point in that if the backup is never receiving traffic except for when the primary goes down, that would reduce the effectiveness.

An option would be to not have the backup write to and update the corpus database, but to have it only reloaded from file at regular intervals. Basically it's just a read-only copy. The administrator would have to use a scheduled file replication to copy it from the primary server. This is something we can add as an option in the ini file rather easily, we'll see if we can have it done before the beta period is over.

Robero F.
LogSat Software



Posted By: LogSat
Date Posted: 15 January 2004 at 10:47pm

That can't currently be done because SpamFilter updates the file every 10 minutes with the new tokens it "learned", thus overwriting any changes that may have been copied over. Only right after SpamFilter updates it, then it's loaded back into memory.

Please see my next posting in this thread for more info though, as there may be a solution for this.

Roberto F.
LogSat Software



Posted By: Guests
Date Posted: 16 January 2004 at 12:00am

That is a good possiblility. I currently have both servers run a script to copy down all the white/black lists so that is just one extra task. easily done.

There has to be some sort of a solution for the servers to share in real time the corpus databse though?

What if you could specify a path for the corpus data, if that could be done, could both servers share the database?

 



Posted By: LogSat
Date Posted: 16 January 2004 at 7:53am

That's not as simple as it would seem, since both SpamFilters would be trying to write to the same files. These operations need to occur extremely fast, and they were the cause of the prvious mem leaks and performance issues we were experiencing. Adding routines that would prevent locking/sharing of the files by multiple instances of SpamFilter would greatly affect performance, and we do not want to do that just now. After this new version is stable and reliable, it will be something we'll take a look at again.

Roberto F.
LogSat Software

<<
What if you could specify a path for the corpus data, if that could be done, could both servers share the database?
>>



Posted By: kspare
Date Posted: 16 January 2004 at 8:20am

Fair Enough.



Posted By: Guests
Date Posted: 17 January 2004 at 8:31pm

This sounds like a good option.

As a variation, would it be possible for to have a primary database which behaves as normal, and a second file which is a read-only copy?

This way clustered mail servers could provide their primary files as a secondary, read-only file for each other...




Print Page | Close Window