Bogofilter
Table of content
Description
Bogofilter is a statistical analysis, Bayesian mail filter. It has to be pre-trained with SPAM and HAM mails and can be later on tuned by retraining false positive as HAM or false negatives as SPAM.
Configuration
Please read first:
- default configuration
- disable
- max_size
- anti-SPAM module configuration
- weight_innocent
- weight_spam
- weight_translate
- user configuration
- default_user
- user_cmd
default_user
Default: -
Allowed values: string (path)
Required: no
Should contain path to a global bogofilter.cf file, if a shared SPAM database should be used. Typically, that would be /etc/bogofilter.cf
cmd_check
Default: "/usr/bin/bogofilter -c %user% -U -I %file% -v"
Allowed values: string (path to bogofilter and cmd args)
Required: yes
The command line to bogofilter check command, including all command line arguments. All variables (%user% = user, %file% = path to temporary mail file) can be used.
cmd_learn_spam
Default: "/usr/bin/bogofilter -c %user% -s -I %file%"
Allowed values: string (path to bogofilter and cmd args)
Required: yes
Command line for learning SPAM with bogofilter.
cmd_unlearn_spam
Default: "/usr/bin/bogofilter -c %user% -N -I %file%"
Allowed values: string (path to bogofilter and cmd args)
Required: yes
Command line for UNlearning SPAM for mails which has been marked as SPAM beforehand.
cmd_learn_ham
Default: "/usr/bin/bogofilter -c %user% -n -I %file%"
Allowed values: string (path to bogofilter and cmd args)
Required: yes
Command line for learn new HAM.
cmd_unlearn_ham
Default: "/usr/bin/bogofilter -c %user% -n -I %file%"
Allowed values: string (path to bogofilter and cmd args)
Required: yes
Command line for UNlearn a mail which has been falsely recognized as SPAM.
apply_spamicity
Default: 0
Allowed values: Bool
Whether the spamicity value of the bogofilter should be factored into the result score. Bogofilter returns a value between 0 and 1, thus if you score SPAM with -50 and bogofilter's spamicity is 0.5 the effective score will be -25.
Example
--- disable: 0 apply_spamicity: 0 cmd_check: '/usr/bin/bogofilter -c %user% -U -I %file% -v' cmd_learn_spam: '/usr/bin/bogofilter -c %user% -s -I %file%' cmd_unlearn_spam: '/usr/bin/bogofilter -c %user% -N -I %file%' cmd_learn_ham: '/usr/bin/bogofilter -c %user% -n -I %file%' cmd_unlearn_ham: '/usr/bin/bogofilter -c %user% -S -I %file%' default_user: '/etc/bogofilter.cf'
Bogofilter hints
This is not about how to configure or run bogofilter in depth, just some issues that might come in handy. No warranties this is the best or even correct way to do it, though.
Global SPAM directory
If you want to use one SPAM database rather than one per (unix) user, you can set the bogofilter_dir in /etc/bogofilter.cf:
bogofilter_dir = /var/spool/bogofilter
In the bogofilter configuration in Decency you should then set the "default_user" to the global config file
default_user = /etc/bogofilter.cf
Initial train bogofilter
As all statistical analysis filters bogofilter requires to be trained before it might come into action. If you have a large SPAM database (HAM you probably have: your inbox), let's say at least 10,000 mails, use those. If you don't you can get an initial SPAM corpus from here or google it or collect it via the HoneyPot / HoneyCollector modules.
Assuming you have your SPAM and HAM files in two directories as eml files, you can train bogofilter like this:
cd spam-ham
find ham/ -type f -exec bogofilter --user-config /etc/bogofilter.cf -n -I {} \;
find spam/ -type f -exec bogofilter --user-config /etc/bogofilter.cf -s -I {} \;
Or if you have mbox files:
cd spam-ham bogofilter --user-config /etc/bogofilter.cf -n < ham.mbox bogofilter --user-config /etc/bogofilter.cf -n < spam.mbox
More detailed informations can be found in the Bogofilter FAQ.
Performance
It has to analyze the whole mail, which can take up to several seconds, but should be mostly under one second. Depends on you SPAM/HAM dataset and the size of the mail.