Print Page | Close Window

Pruning the corpus?

Printed From: LogSat Software
Category: Spam Filter ISP
Forum Name: Spam Filter ISP Support
Forum Description: General support for Spam Filter ISP
URL: https://www.logsat.com/spamfilter/forums/forum_posts.asp?TID=3132
Printed Date: 09 May 2025 at 6:45pm


Topic: Pruning the corpus?
Posted By: Guests
Subject: Pruning the corpus?
Date Posted: 09 March 2004 at 6:39pm

Taking a look at the corpus dump it has quite a few entries made up of random character strings.  Likely the random strings that some spammers now put in their emails.

Would it be practical to come up with a routine to systematically root out and delete some of these random strings so they don't eventually bog down the corpus?  Maybe just the entries with 11 or more random characters that have only one occurance outside of the past 30 days.  It could be setup so that if the admin wanted to do so they can run this once in a while as part of their typical maint, like compacting a database.

I don't know if this is a practical idea, but just a thought and I figured it to be worth a mention.




Replies:
Posted By: LogSat
Date Posted: 10 March 2004 at 11:02pm

Alan,

Every 12 hours a built-in clenaup procedure prunes the corpus db.dat file for any stale entries. A "stale" token is a word (token) that has not appeared in an email in the past n days. The "n" days is defined in the SpamFilter.ini file under:

;Remove any stale token in the corpus db.dat file that did not appear in incoming emails for the past n days
CleanUpCorpusIntervalDays=7

This helps in keeping the corpus to a manageable size.

Roberto F.
LogSat Software




Print Page | Close Window