I Hate Email Spam

By

Andrew Pitonyak


HOMEBookConferencesOpenOffice.org MacrosDatabaseSpamPerlMy Perl UtilitiesVYMMiscMagic

Spam, what can I say about it? I am too busy to be eloquent which is the same reason that I am too busy to read spam. There is an excellent article by Paul Graham on how to use statistical analysis to identify spam. Gary Arnold has an excellent implementation of this method. Unfortunately, the initial implementation did not meet my needs so I developed my own.
Of course, I no longer use my own Spam filter now that I have moved to the Thunderbird email program, but this is how I used to filter spam using PMMail. I have to admit that I liked PMMail better than I like Thunderbird, but Thunderbird actually meets my needs and PMMail does not (no new development in a long time). Thunderbird has its own SPAM recognition methodology, and it seems to actually work. As I said, not as well as my filters do, but... When I have time, perhaps I will filter my email using my own filters rather than those provided my Thunderbird.

Summary

The method at a glance is as follows:
  1. Collect a large number of good email messages
  2. Collect a large number of bad (spam) email messages
  3. Count the occurences of each token (word) in the good email messages.
  4. Count the occurences of each token (word) in the bad (spam) email messages.
  5. Determine the probability that a particular token would appear in a bad email message.
  6. For each email message of interest, use the probability that each token would appear in a good or bad email message to assign a probability that the message is bad.
I have about 400 bad email messages and a few thousand good email messages that I have received and collected. Experience indicates that if you filter your email messages with my probability file, the filtering will be poor. It is important that you collect a large number of your own good and bad messages.

PMMail

My desire was to produce something that would work with my email software of choice, PMMail. I do multiple things to filter the good email from the bad.

First, I created address books. I have address books for family members, people with whom I work, and even for my mailing lists. I also created folders to receive my filtered messages. In PMMail I added a complex filter similar to

h.fromid="$ab.Family" | h.fromid="$ab.Work" | h.fromid="$ab.Friends"

This acts as a white list moving email from people in my address books to a Pass folder. I have similar filters my mailing lists. Email that passes no filter I leave in my inbox.

PMMail can not support call an external filter. It does, however, support a "user hook" for messages that pass a filter. I have a filter that searches the header for a space, it always passes and calls my batch file. My batch file does a statistical analysis of the message and then inserts a new header into the message if it thinks that it is spam. This only happens to messages that do not pass the white list filters.

My next filter searches headers for the text "X-Andy-Spam: Probably spam", moving message to the Spam folder. If a header is inserted for a message considered good, it will read "X-Andy-Spam: Probably not spam" and include the probability. When a header is inserted is configurable as shown later.

How Do I Do This

Install Perl

I wrote the filter in Perl. There are many Perl implementations available for Windows, I use a version that is freely available from ActiveState. If you follow the link to ActiveState, you can click on the button in the upper left corner that says "Download." ActiveState requires you to register before downloading (I am told that you do not have to supply any information). Choose what you desire, download it, and install it.

Install Required Modules

My code uses an extra module. Perl can automatically install it for you.
  1. Open a command prompt
  2. Go to the perl bin directory. This is probably c:\perl\bin.
  3. Type ppm.bat to run the Perl Package Manager
  4. Type help if you want help
  5. Type install MIME-tools
  6. Type quit

Copy The Code

My code uses my Perl utility packages. These packages must be in a directory called Pitonyak. When Perl searches for packages it looks in its own lib directory; probably c:\perl\lib. If you do not place the Pitonyak directory here, then you must tell perl where it is located by setting the perllib environment variable. If you create the directory "C:\spam\Pitonyak", then set "perllib=C:\spam".

You may place the spam scripts where ever you desire.

Configure The Scripts

Modify pmmail_build_tokens.bat

pmmail_build_tokens.bat builds the good, bad, and probability token files. It is hard coded to build my token files from my email messages. You must modify the batch file to modify your own token files from your own email messages. The batch file is very short.

  1. del q:\devsrc\Perl\spam\andy_*.dat Delete the token files. You may want to change the paths and file names.
  2. perl -w q:\devsrc\Perl\spam\tokenize_file.pl -r -o q:\devsrc\Perl\spam\andy_good.dat -s c:\pmmail\andyp_0.act\Known1.FLD\*.msg -s c:\pmmail\andyp_0.act\pass0.FLD\*.msg -s c:\pmmail\andyp_0.act\PERSON0.FLD\*.msg --log_cfg q:\devsrc\Perl\spam\logger.dat
    The meaning of each parameter is summarized here. This creates the andy_good.dat token file from my good email messages. Be certain to modify the directory locations and file names as appropriate. Add any required search files for the email message files.
  3. perl -w q:\devsrc\Perl\spam\tokenize_file.pl -r -o q:\devsrc\Perl\spam\andy_bad.dat -s c:\pmmail\andyp_0.act\SPAM0.FLD\Verif0.FLD\*.msg --log_cfg q:\devsrc\Perl\spam\logger.dat
    This creates the bad token file using the same methods as in the previous step.
  4. perl -w q:\devsrc\Perl\spam\build_probabilities.pl -b q:\devsrc\Perl\spam\andy_bad.dat -g q:\devsrc\Perl\spam\andy_good.dat -p q:\devsrc\Perl\spam\andy_prob.dat --log_cfg q:\devsrc\Perl\spam\logger.dat
    The meaning of each parameter is summarized here. Build the probability file containg the liklihood a token is good or bad.
  5. Configure pmmail_spam.bat

    PMMail, using the external user hook, calls this batch file once for each message. I can manually run the perl scripts, and check a few thousand messages at a time.

    This batch file is very simple consisting of one line.
    perl -w q:\devsrc\Perl\spam\spam_check_file.pl -sl 0.90 -a 0.5 -p q:\devsrc\Perl\spam\andy_prob.dat --log_cfg q:\devsrc\Perl\spam\logger.dat -s %1
    The meaning of each parameter is summarized here. Besides the obvious changes related to the directory structure and file names, you may want to change -sl 0.90 to -sl 0.99. This reduces the liklihood of a false positive. You may also want to change -a 0.5 to -a 0.99 so that the header will only be introduced if the email is considered spam.

    Configure logger.dat

    This is the file that the batch files use to configure the output of the scripts. You can check the meaning of each parameter here. When a message is logged, it is given a type. This type is used to decide where the message should be logged. The primary keys of interest are screen_output and file_output. The primary output types are as follows

    TypeType MeaningSmallLogger Method
    W Warningwarn($message)
    I Info info($message)
    E Error error($message)
    T Trace trace($message)
    D Debug debug($message)
    F2 write_log_type('F2', $message);

    What does a large business do?

    By using a dedicated business email hosting service you get some built in spam filtering that will help your business email not get deluged with unwanted spam.
    Last Modified July 5, 2010 01:59:46 PM UTC© 1999-2024 Andrew Pitonyak (email me at: andy @ pitonyak.org)