CRM114

Description

Implementation of the CRM114 Discriminator. CRM114 describes itself as a programmable, fast learning data examiner for various purposes. It can be easily trained to classify mails as SPAM or HAM.

Configuration

Please read first:

default_user

Allowed values: String (path to crm114 directory)
Required: no
Default: -

Can be set to global crm114 directory which probably is in /etc/crm114/ or /var/spool/crm114/ (see below).

The user in the CRM114 context is a directory where the CRM114 filter files (*.mfp, *.css, reaver_cache-dir and so on) resides.

Example:

CRM114 lower than -10 will be translated to -100, scores between -3 and -10 will be translated to -50 and so on ..

weight_translate:
     5: 20
     1: 10
     0: 0
     -2: 0
     -3: -50
     -10: -100

cmd_check

Default: "/usr/share/crm114/mailreaver.crm --fileprefix=%user% -u %user% --report_only"
Allowed values: String (path to bogofilter and cmd args)
Required: yes

The command line to bogofilter check command, including all command line arguments. All variables (%user% = user, %file% = path to temporary mail file) can be used.

cmd_learn_spam, cmd_unlearn_ham

Default: "/usr/share/crm114/mailfilter.crm --fileprefix=%user% -u %user% --learnspam
Allowed values: String (path to bogofilter and cmd args)
Required: yes

Command line used for training new SPAM / unlearning HAM mails.

cmd_learn_ham, cmd_unlearn_spam

Default: "/usr/share/crm114/mailfilter.crm --fileprefix=%user% -u %user% --learngood"
Allowed values: String (path to bogofilter and cmd args)
Required: yes

Command line used for training new HAM / unlearning SPAM mails.

Example

---

disable: 0

default_user: /var/spool/crm114/

# > 5: 20
# 1 -> 5: 20
# 0 -> 1: 10
# -2 -> 0: 0
# -3 -> -2: -50
# -10 -> -3: -100
# <-10: -100
weight_translate:
    5: 20
    1: 10
    0: 0
    -2: 0
    -3: -50
    -10: -100

# cmd_check: '/usr/share/crm114/mailreaver.crm --fileprefix=%user% -u %user% --report_only'
# cmd_learn_spam: '/usr/share/crm114/mailfilter.crm --fileprefix=%user% -u %user% --learnspam'
# cmd_unlearn_spam: '/usr/share/crm114/mailfilter.crm --fileprefix=%user% -u %user% --learngood'
# cmd_learn_ham: '/usr/share/crm114/mailfilter.crm --fileprefix=%user% -u %user% --learngood'
# cmd_unlearn_ham: '/usr/share/crm114/mailfilter.crm --fileprefix=%user% -u %user% --learnspam'

CRM114 hints

This is a very simplified installation. For detailed / more in-depth information: google. I assume you don't want an per-user- but a global-css-database.

Install CRM114

First of get crm114. In a debian system that would be:

aptitude install crm114

For anybody else: go to the download page and follow the instructions.

Setup

First you require a directory, which will contain the configuration files and also the "reaver_cache", which can grow as large as all the emails you fed in accumulated.

mkdir /var/spool/crm114
cd /var/spool/crm114

Now copy basic configuration file from your crm114 base installation in you directory.

cp /usr/share/crm114/mailfilter.cf .

Create empty required files

touch rewrites.mfp priolist.mfp whitelist.mfp blacklist.mfp

Create empty css files for SPAM and HAM

cssutil -b -r spam.css
cssutil -b -r nonspam.css

Adjust the mailfilter.cf for your needs. Especially have a look at the following keys:

:spw: /mypassword/
:add_verbose_stats: /no/
:add_extra_stuff: /no/
:rewrites_enabled: /no/
:spam_flag_subject_string: //
:unsure_flag_subject_string: //
:log_to_allmail.txt: /no/

Also be aware of the thresholds, if you will not use the weight_translate, but the weight_spam and weight_ham directives:

:good_threshold: /10.0/
:spam_threshold: /-5.0/

Remember to adjust the directory ownership to your Decency user

chown mailuser:mailgroup -R /var/spool/crm114

Initial training

Train your first mails into crm114. You require a directory containing SPAM and one containing HAM, then this will work:

/usr/share/crm114/mailtrainer.crm --spam=/path/to/spamdir --good=/path/to/hamdir \
    --fileprefix=/var/spool/crm114/

Thats all

Performance

It has to analyze the whole mail, which can take up to several seconds, but should be stay under two seconds, most of the time. Depends on you SPAM/HAM dataset and the size of the mail.