Pruning the corpus? |
Post Reply ![]() |
Author | |
Alan ![]() Guest Group ![]() |
![]() ![]() ![]() ![]() ![]() Posted: 09 March 2004 at 6:39pm |
Taking a look at the corpus dump it has quite a few entries made up of random character strings. Likely the random strings that some spammers now put in their emails. Would it be practical to come up with a routine to systematically root out and delete some of these random strings so they don't eventually bog down the corpus? Maybe just the entries with 11 or more random characters that have only one occurance outside of the past 30 days. It could be setup so that if the admin wanted to do so they can run this once in a while as part of their typical maint, like compacting a database. I don't know if this is a practical idea, but just a thought and I figured it to be worth a mention. |
|
![]() |
|
LogSat ![]() Admin Group ![]() ![]() Joined: 25 January 2005 Location: United States Status: Offline Points: 4104 |
![]() ![]() ![]() ![]() ![]() |
Alan, Every 12 hours a built-in clenaup procedure prunes the corpus db.dat file for any stale entries. A "stale" token is a word (token) that has not appeared in an email in the past n days. The "n" days is defined in the SpamFilter.ini file under: ;Remove any stale token in the corpus db.dat file that did not appear in incoming emails for the past n days This helps in keeping the corpus to a manageable size. Roberto F. |
|
![]() |
Post Reply ![]() |
|
Tweet
|
Forum Jump | Forum Permissions ![]() You cannot post new topics in this forum You cannot reply to topics in this forum You cannot delete your posts in this forum You cannot edit your posts in this forum You cannot create polls in this forum You cannot vote in polls in this forum |
This page was generated in 0.184 seconds.