Spam Filter ISP Support Forum

  New Posts New Posts RSS Feed - Pruning the corpus?
  FAQ FAQ  Forum Search   Register Register  Login Login

Pruning the corpus?

 Post Reply Post Reply
Author
Alan View Drop Down
Guest Group
Guest Group
Post Options Post Options   Thanks (0) Thanks(0)   Quote Alan Quote  Post ReplyReply Direct Link To This Post Topic: Pruning the corpus?
    Posted: 09 March 2004 at 6:39pm

Taking a look at the corpus dump it has quite a few entries made up of random character strings.  Likely the random strings that some spammers now put in their emails.

Would it be practical to come up with a routine to systematically root out and delete some of these random strings so they don't eventually bog down the corpus?  Maybe just the entries with 11 or more random characters that have only one occurance outside of the past 30 days.  It could be setup so that if the admin wanted to do so they can run this once in a while as part of their typical maint, like compacting a database.

I don't know if this is a practical idea, but just a thought and I figured it to be worth a mention.

Back to Top
LogSat View Drop Down
Admin Group
Admin Group
Avatar

Joined: 25 January 2005
Location: United States
Status: Offline
Points: 4104
Post Options Post Options   Thanks (0) Thanks(0)   Quote LogSat Quote  Post ReplyReply Direct Link To This Post Posted: 10 March 2004 at 11:02pm

Alan,

Every 12 hours a built-in clenaup procedure prunes the corpus db.dat file for any stale entries. A "stale" token is a word (token) that has not appeared in an email in the past n days. The "n" days is defined in the SpamFilter.ini file under:

;Remove any stale token in the corpus db.dat file that did not appear in incoming emails for the past n days
CleanUpCorpusIntervalDays=7

This helps in keeping the corpus to a manageable size.

Roberto F.
LogSat Software

Back to Top
 Post Reply Post Reply
  Share Topic   

Forum Jump Forum Permissions View Drop Down



This page was generated in 0.184 seconds.