<?xml version="1.0" encoding="utf-8" ?>
<?xml-stylesheet type="text/xsl" href="RSS_xslt_style.asp" version="1.0" ?>
<rss version="2.0" xmlns:WebWizForums="http://syndication.webwiz.co.uk/rss_namespace/">
 <channel>
  <title>Spam Filter ISP Forums : Pruning the corpus?</title>
  <link>https://www.logsat.com/spamfilter/forums/</link>
  <description><![CDATA[This is an XML content feed of; Spam Filter ISP Forums : Spam Filter ISP Support : Pruning the corpus?]]></description>
  <pubDate>Sun, 08 Mar 2026 13:08:55 +0000</pubDate>
  <lastBuildDate>Wed, 10 Mar 2004 23:02:00 +0000</lastBuildDate>
  <docs>http://blogs.law.harvard.edu/tech/rss</docs>
  <generator>Web Wiz Forums 11.04</generator>
  <ttl>360</ttl>
  <WebWizForums:feedURL>https://www.logsat.com/spamfilter/forums/RSS_post_feed.asp?TID=3132</WebWizForums:feedURL>
  <image>
   <title><![CDATA[Spam Filter ISP Forums]]></title>
   <url>https://www.logsat.com/spamfilter/forums/forum_images/web_wiz_forums.png</url>
   <link>https://www.logsat.com/spamfilter/forums/</link>
  </image>
  <item>
   <title><![CDATA[Pruning the corpus? : Alan, Every 12 hours a built-in...]]></title>
   <link>https://www.logsat.com/spamfilter/forums/forum_posts.asp?TID=3132&amp;PID=3141&amp;title=pruning-the-corpus#3141</link>
   <description>
    <![CDATA[<strong>Author:</strong> <a href="https://www.logsat.com/spamfilter/forums/member_profile.asp?PF=8">LogSat</a><br /><strong>Subject:</strong> 3132<br /><strong>Posted:</strong> 10 March 2004 at 11:02pm<br /><br /><P>Alan,</P><P>Every 12 hours a built-in clenaup procedure prunes the corpus db.dat file for any stale entries. A "stale" token is a word (token) that has not appeared in an email in the past n days. The "n" days is defined in the SpamFilter.ini file under:</P><P>;Remove any stale token in the corpus db.dat file that did not appear in incoming emails for the past n days<BR>CleanUpCorpusIntervalDays=7</P><P>This helps in keeping the corpus to a manageable size.</P><P>Roberto F.<BR>LogSat Software</P>]]>
   </description>
   <pubDate>Wed, 10 Mar 2004 23:02:00 +0000</pubDate>
   <guid isPermaLink="true">https://www.logsat.com/spamfilter/forums/forum_posts.asp?TID=3132&amp;PID=3141&amp;title=pruning-the-corpus#3141</guid>
  </item> 
  <item>
   <title><![CDATA[Pruning the corpus? : Taking a look at the corpus dump...]]></title>
   <link>https://www.logsat.com/spamfilter/forums/forum_posts.asp?TID=3132&amp;PID=3132&amp;title=pruning-the-corpus#3132</link>
   <description>
    <![CDATA[<strong>Author:</strong> <a href="https://www.logsat.com/spamfilter/forums/member_profile.asp?PF=2">Guests</a><br /><strong>Subject:</strong> 3132<br /><strong>Posted:</strong> 09 March 2004 at 6:39pm<br /><br /><P>Taking a look at the corpus dump it has quite a few entries made up of random character strings.&nbsp;&nbsp;Likely the random strings that some spammers now put in their emails.</P><P>Would it be practical to come up with a routine to systematically root out and delete some of these random strings so they don't eventually bog down the corpus?&nbsp; Maybe just the entries with 11 or more random characters that have only one occurance outside of the past 30 days.&nbsp;&nbsp;It could be setup so that if the&nbsp;admin wanted to do so they&nbsp;can run this once in a while as part of their typical maint, like compacting a database.</P><P>I don't know if this is a practical idea, but just a thought and I figured it to be worth a mention.</P>]]>
   </description>
   <pubDate>Tue, 09 Mar 2004 18:39:00 +0000</pubDate>
   <guid isPermaLink="true">https://www.logsat.com/spamfilter/forums/forum_posts.asp?TID=3132&amp;PID=3132&amp;title=pruning-the-corpus#3132</guid>
  </item> 
 </channel>
</rss>