Spam Filter ISP Support Forum

  New Posts New Posts RSS Feed - Statistics: keyword
  FAQ FAQ  Forum Search   Register Register  Login Login

Statistics: keyword

 Post Reply Post Reply
Author
meatboy View Drop Down
Newbie
Newbie


Joined: 26 June 2006
Status: Offline
Points: 18
Post Options Post Options   Thanks (0) Thanks(0)   Quote meatboy Quote  Post ReplyReply Direct Link To This Post Topic: Statistics: keyword
    Posted: 15 August 2006 at 2:21am

Hi,

As a suggestion to improve Spamfilter would it be possible to add statistics on how often a keyword has been used to find spam? A count showing the keywords/regex effectiveness?

I suspect the order that the keywords are scanned would mean that keywords that are "higher up" in the list would tend to score more but the information would at least show those keywords that are not of any use. The idea is to reduce the number of useless words.

Could this be implemented and would it be of any use?

Tim



Edited by meatboy
Back to Top
sgeorge View Drop Down
Senior Member
Senior Member


Joined: 23 August 2005
Status: Offline
Points: 178
Post Options Post Options   Thanks (0) Thanks(0)   Quote sgeorge Quote  Post ReplyReply Direct Link To This Post Posted: 16 August 2006 at 11:40am
Hi meatboy.  Actually, this is very possible if you are using a quarantine database and quarantine all messages that blocked because of keyword matches.  Here's some SQL that should give you a list of the keywords that sent messages to the quarantine, sorted by greatest # of occurances per keyword:

SELECT rejectdetails, count(*)
FROM tblQuarantine
WHERE rejectid = 13
group by rejectdetails
ORDER BY count(*) desc


Stephen

Back to Top
LogSat View Drop Down
Admin Group
Admin Group
Avatar

Joined: 25 January 2005
Location: United States
Status: Offline
Points: 4105
Post Options Post Options   Thanks (0) Thanks(0)   Quote LogSat Quote  Post ReplyReply Direct Link To This Post Posted: 16 August 2006 at 4:13pm
Thanks sgeorge, excellent idea, we'll be using that ourselves !!
Roberto Franceschetti

LogSat Software

Spam Filter ISP
Back to Top
sgeorge View Drop Down
Senior Member
Senior Member


Joined: 23 August 2005
Status: Offline
Points: 178
Post Options Post Options   Thanks (0) Thanks(0)   Quote sgeorge Quote  Post ReplyReply Direct Link To This Post Posted: 16 August 2006 at 4:57pm
Oh my , well thanks!

Stephen
Back to Top
meatboy View Drop Down
Newbie
Newbie


Joined: 26 June 2006
Status: Offline
Points: 18
Post Options Post Options   Thanks (0) Thanks(0)   Quote meatboy Quote  Post ReplyReply Direct Link To This Post Posted: 16 August 2006 at 7:15pm

Hi Sgeorge,

I have tried a quarantine DB but only the access one that LogSat provide. I like you idea though. Perhaps I can swap over to a SQL Db instead.  Thanks for the idea!

Tim

Back to Top
Desperado View Drop Down
Senior Member
Senior Member
Avatar

Joined: 27 January 2005
Location: United States
Status: Offline
Points: 1143
Post Options Post Options   Thanks (0) Thanks(0)   Quote Desperado Quote  Post ReplyReply Direct Link To This Post Posted: 18 August 2006 at 9:09am

Hmmm ... I still think Sawmill works better for this as it looka at ALL the blocked items rather than just the ones that are still in quarantine as below: (not sure how this will post)

  Keywords Messages Bytes
1 [((?i)<div><font face=arial size=2><img alt="" hspace=0)] 1,863 33.8 % 14.58 M
2 [((?i)Subject:=\?ISO\-\d*\-\1?.*\?[a-z0-9]{20,})] 912 16.6 % 1.77 M
3 [((?i)\<(font|span)[^>]+style[^>]+float[^>]*:[^>]*right)] 874 15.9 % 1.90 M
4 [(con\$ign\-net)] 434 7.9 % 7.57 M
5 [((?i)(((been|are) pre\-(approved|qualified))|(email is a commercial adv)))] 319 5.8 % 639.00 k
6 Found prohibited attachment 261 4.7 % 677.00 k
7 [((?i)<.?object)] 215 3.9 % 583.00 k
8 Exceeded max spaces in subject 148 2.7 % 1.04 M
9 [((?i)Subject:=\?utf\-\d*\?.*\?[a-z0-9]{20,})] 110 2.0 % 181.00 k
10 [((?i)<.?iframe)] 109 2.0 % 279.00 k
11 [Found Content-Transfer-Encoding=base64 and Content-Type=text/html/plain] 65 1.2 % 179.00 k
12 [((?i)((want watch)|(need watch)|(r0lex)|(bom\.evif)|(/replica/)|(z\.php)|(/r/sales)|(/rep/sales)|((fogw)|(eank)|(toels)... 57 1.0 % 116.00 k
13 [((?i)\<\!\[cdata\[)] 37 0.7 % 77.00 k
14 [((?i)Subject:(([\s]|[\!-\xB4]){0,10}[\|]){2})] 29 0.5 % 5.00 k
15 [((?i)style>(.){5,30}visibility: hidden;)] 26 0.5 % 27.00 k
16 [((?i)((ivacy is extremely import)|(this is not spam)|(not wanting to receiv)|(killers without prescrip)))] 25 0.5 % 45.00 k
17 [((?i)subject:.*(@.+@))] 11 0.2 % 13.00 k
18 [((?i)subject:.*((ڨ){3,}.*))] 7 0.1 % 0 b
19 [((?i)https://www\.surepayroll\.com)] 3 0.1 % 3.00 k
  Total 5,505 100 %   29.61 M


The Desperado
Dan Seligmann.
Work: http://www.mags.net
Personal: http://www.desperado.com

Back to Top
sgeorge View Drop Down
Senior Member
Senior Member


Joined: 23 August 2005
Status: Offline
Points: 178
Post Options Post Options   Thanks (0) Thanks(0)   Quote sgeorge Quote  Post ReplyReply Direct Link To This Post Posted: 18 August 2006 at 10:08am
meatboy, I use an Access DB as well and that query does the trick for me... I believe that the query should work unchanged in SQL and mySQL as well.

Desperado, good point.  Only reason I don't use Sawmill is because my evaluation expired.   Hey, glad to see that some of my keywords are in your top 5 list.

Stephen
Back to Top
Desperado View Drop Down
Senior Member
Senior Member
Avatar

Joined: 27 January 2005
Location: United States
Status: Offline
Points: 1143
Post Options Post Options   Thanks (0) Thanks(0)   Quote Desperado Quote  Post ReplyReply Direct Link To This Post Posted: 18 August 2006 at 10:24am

Stephen,

Top 1 even!

Thanks



Edited by Desperado
The Desperado
Dan Seligmann.
Work: http://www.mags.net
Personal: http://www.desperado.com

Back to Top
 Post Reply Post Reply
  Share Topic   

Forum Jump Forum Permissions View Drop Down



This page was generated in 0.117 seconds.