Auto learn spam/ham with Dovecot imap_sieve plugin

Summary

Dovecot offers plugin imap_sieve to run sieve script for spam/virus scanning, it's useful to let end users report spam/ham messages within webmail or MUA, then on server side we call SpamAssassin to learn the reported messages. The more spams/hams end users reported, the more precisely SpamAssassin can catch the spams.

This tutorial shows you how to enable Dovecot plugin imap_sieve and create required shell/sieve scripts to learn spams automatically.

After setup, you can encourage end users to report spam messages by moving/dragging spam to Junk folder. With more spams reported, your iRedMail server can precisely catch more spams.

Requirements

Enable imap_sieve plugin

Please update Dovecot config file /etc/dovecot/dovecot.conf to:

# Store METADATA information within user's HOME directory
mail_attribute_dict = file:%h/dovecot-attributes

protocol imap {
    ...
    mail_plugins = ... imap_sieve
}

plugin {
    sieve_plugins = sieve_imapsieve sieve_extprograms
    imapsieve_url = sieve://127.0.0.1:4190

    # From elsewhere to Junk folder
    imapsieve_mailbox1_name = Junk
    imapsieve_mailbox1_causes = COPY APPEND
    imapsieve_mailbox1_before = file:/var/vmail/sieve/report_spam.sieve

    # From Junk folder to elsewhere
    imapsieve_mailbox2_name = *
    imapsieve_mailbox2_from = Junk
    imapsieve_mailbox2_causes = COPY
    imapsieve_mailbox2_before = file:/var/vmail/sieve/report_ham.sieve

    sieve_pipe_bin_dir = /etc/dovecot/sieve/pipe

    sieve_global_extensions = +vnd.dovecot.pipe +vnd.dovecot.environment

}

Create required directories and files

We will create few directories and files used by imap_sieve plugin:

Create directories:

mkdir -p /etc/dovecot/sieve/pipe
mkdir -p /var/vmail/imapsieve_copy
chown vmail:vmail /var/vmail/imapsieve_copy
chmod 0700 /var/vmail/imapsieve_copy

Create file /var/vmail/sieve/report_spam.sieve with content below:

require ["vnd.dovecot.pipe", "copy", "imapsieve", "environment", "variables"];

if environment :matches "imap.user" "*" {
    set "username" "${1}";
}

pipe :copy "imapsieve_copy" [ "${username}", "spam" ];

Create file /var/vmail/sieve/report_ham.sieve with content below:

require ["vnd.dovecot.pipe", "copy", "imapsieve", "environment", "variables"];

if environment :matches "imap.mailbox" "*" {
    set "mailbox" "${1}";
}

if string "${mailbox}" "Trash" {
    stop;
}

if environment :matches "imap.user" "*" {
    set "username" "${1}";
}

pipe :copy "imapsieve_copy" [ "${username}", "ham" ];

Create file /etc/dovecot/sieve/pipe/imapsieve_copy with content below:

#!/usr/bin/env bash
# Author: Zhang Huangbin <zhb@iredmail.org>
# Purpose: Read full email message from stdin, and save to a local file.

# Usage: bash imapsieve_copy <email> <spam|ham> <output_base_dir>

export USER="$1"
export MSG_TYPE="$2"

export OUTPUT_BASE_DIR="/var/vmail/imapsieve_copy"
export OUTPUT_DIR="${OUTPUT_BASE_DIR}/${MSG_TYPE}"
export FILE="${OUTPUT_DIR}/${USER}-$(date +%Y%m%d%H%M%S)-${RANDOM}${RANDOM}.eml"

export OWNER="vmail"
export GROUP="vmail"

for dir in "${OUTPUT_BASE_DIR}" "${OUTPUT_DIR}"; do
    if [[ ! -d ${dir} ]]; then
        mkdir -p ${dir}
        chown ${OWNER}:${GROUP} ${dir}
        chmod 0700 ${dir}
    fi
done

cat > ${FILE} < /dev/stdin

# Logging
#export LOG='logger -p local5.info -t imapsieve_copy'
#[[ $? == 0 ]] && ${LOG} "Copied one ${MSG_TYPE} email reported by ${USER}: ${FILE}"

Set correct file owner and permissions:

chown vmail:vmail /var/vmail/sieve/report_spam.sieve \
    /var/vmail/sieve/report_ham.sieve \
    /etc/dovecot/sieve/pipe/imapsieve_copy

chmod 0700 /var/vmail/sieve/report_spam.sieve \
    /var/vmail/sieve/report_ham.sieve \
    /etc/dovecot/sieve/pipe/imapsieve_copy

Restart Dovecot service to enable this plugin.

service dovecot restart

Setup cron job to scan and learn spam/ham messages

Dovecot can now save a copy of reported spam/ham automatically, we still need a shell script to call SpamAssassin to actually learn spam/ham periodly.

Create script /etc/dovecot/sieve/scan_reported_mails.sh with content below, it's used to call sa-learn command to learn reported spam/ham emails:

Attention

If you're running FreeBSD or OpenBSD, please change the Amavisd daemon user name in variable AMAVISD_USER below.

#!/usr/bin/env bash
# Author: Zhang Huangbin <zhb@iredmail.org>
# Purpose: Copy spam/ham to another directory and call sa-learn to learn.

# Paths to find program.
export PATH="/bin:/usr/bin:/usr/local/bin:$PATH"

export OWNER="vmail"
export GROUP="vmail"

# The Amavisd daemon user.
# Note: on OpenBSD, it's "_vscan". On FreeBSD, it's "vscan".
export AMAVISD_USER='amavis'

# Kernel name, in upper cases.
export KERNEL_NAME="$(uname -s | tr '[a-z]' '[A-Z]')"

# A temporary lock file. should be removed after successfully examed messages.
export LOCK_FILE='/tmp/scan_reported_mails.lock'

# Logging to syslog with 'logger' command.
export LOG='logger -p local5.info -t scan_reported_mails'

# `sa-learn` command, with optional arguments.
export SA_LEARN="sa-learn -u ${AMAVISD_USER}"

# Spool directory.
# Must be owned by vmail:vmail.
export SPOOL_DIR='/var/vmail/imapsieve_copy'

# Directories which store spam and ham emails.
# These 2 should be created while setup Dovecot antispam plugin.
export SPOOL_SPAM_DIR="${SPOOL_DIR}/spam"
export SPOOL_HAM_DIR="${SPOOL_DIR}/ham"

# Directory used to store emails we're going to process.
# We will copy new spam/ham messages to these directories, scan them, then
# remove them.
export SPOOL_LEARN_SPAM_DIR="${SPOOL_DIR}/processing/spam"
export SPOOL_LEARN_HAM_DIR="${SPOOL_DIR}/processing/ham"

if [ -e ${LOCK_FILE} ]; then
    find $(dirname ${LOCK_FILE}) -maxdepth 1 -ctime 1 "$(basename ${LOCK_FILE})" >/dev/null 2>&1
    if [ X"$?" == X'0' ]; then
        rm -f ${LOCK_FILE} >/dev/null 2>&1
    else
        ${LOG} "Lock file exists (${LOCK_FILE}), abort."
        exit
    fi
fi

for dir in "${SPOOL_DIR}" "${SPOOL_LEARN_SPAM_DIR}" "${SPOOL_LEARN_HAM_DIR}"; do
    if [[ ! -d ${dir} ]]; then
        mkdir -p ${dir}
    fi

    chown ${OWNER}:${GROUP} ${dir}
    chmod 0700 ${dir}
done

# If there're a lot files, direct `mv` command may fail with error like
# `argument list too long`, so we need `find` in this case.
if [[ X"${KERNEL_NAME}" == X'OPENBSD' ]]; then
    [[ -d ${SPOOL_SPAM_DIR} ]] && find ${SPOOL_SPAM_DIR} -name '*.eml' -exec mv {} ${SPOOL_LEARN_SPAM_DIR}/ \;
    [[ -d ${SPOOL_HAM_DIR} ]]  && find ${SPOOL_HAM_DIR}  -name '*.eml' -exec mv {} ${SPOOL_LEARN_HAM_DIR}/  \;
else
    [[ -d ${SPOOL_SPAM_DIR} ]] && find ${SPOOL_SPAM_DIR} -name '*.eml' -exec mv -t ${SPOOL_LEARN_SPAM_DIR}/ {} +
    [[ -d ${SPOOL_HAM_DIR} ]]  && find ${SPOOL_HAM_DIR}  -name '*.eml' -exec mv -t ${SPOOL_LEARN_HAM_DIR}/  {} +
fi

# Try to delete empty directory, if failed, that means we have some messages to
# scan.
rmdir ${SPOOL_LEARN_SPAM_DIR} &>/dev/null
if [[ X"$?" != X'0' ]]; then
    output="$(${SA_LEARN} --spam ${SPOOL_LEARN_SPAM_DIR})"
    rm -rf ${SPOOL_LEARN_SPAM_DIR} &>/dev/null
    ${LOG} '[SPAM]' ${output}
fi

rmdir ${SPOOL_LEARN_HAM_DIR} &>/dev/null
if [[ X"$?" != X'0' ]]; then
    output="$(${SA_LEARN} --ham ${SPOOL_LEARN_HAM_DIR})"
    rm -rf ${SPOOL_LEARN_HAM_DIR} &>/dev/null
    ${LOG} '[CLEAN]' ${output}
fi

rm -f ${LOCK_FILE} &>/dev/null

Run command crontab -e -u root to setup cron job for root user, scan emails every 10 minutes:

# iRedMail: Scan reported mails.
*/10   *   *   *   *   /bin/bash /etc/dovecot/sieve/scan_reported_mails.sh

Tests

Report spam: Move email from Inbox to Junk

In Dovecot log file /var/log/dovecot/imap.log (or dovecot.log if you didn't configure syslog daemon to separate log content), you should see log lines like below:

Jan 31 21:10:42 c7 dovecot: imap(<email>): sieve: pipe action: piped message to program `imapsieve_copy'
Jan 31 21:10:42 c7 dovecot: imap(<email>): sieve: left message in mailbox 'Junk'
Jan 31 21:10:42 c7 dovecot: imap(<email>): expunge: box=INBOX, uid=7, msgid=, size=7805, from=<email>, subject=<subject>

In the meantime, you should see an email in /var/vmail/imapsieve_copy/spam/, file name in <email>-<timestamp>-<random_number>.eml format.

Report ham: Move email from Junk to any other folder (except Trash)

If you found a clean email in Junk folder, just move it from Junk to any other folder except Trash.

In Dovecot log file /var/log/dovecot/imap.log (or dovecot.log), you should see log lines like below:

Jan 31 21:15:51 c7 dovecot: imap(<email>): sieve: pipe action: piped message to program `imapsieve_copy'
Jan 31 21:15:51 c7 dovecot: imap(<email>): sieve: left message in mailbox 'INBOX'
Jan 31 21:15:51 c7 dovecot: imap(<email>): expunge: box=Junk, uid=7, msgid=, size=7805, from=<email>, subject=<subject>

In the meantime, you should see an email in /var/vmail/imapsieve_copy/ham/, file name in <email>-<timestamp>-<random_number>.eml format.

Scan reported mails

It's ok to run the script manually to scan reported mails:

bash /etc/dovecot/sieve/scan_reported_mails.sh

If it scanned messages, it will log a message in /var/log/syslog or /var/log/messages like this:

Jan 31 04:51:34 mail scan_reported_mails: [CLEAN] Learned tokens from 1 message(s) (1 message(s) examined)
Jan 31 05:03:16 mail scan_reported_mails: [SPAM] Learned tokens from 1 message(s) (1 message(s) examined)

Check detailed bayes learning log on command line

You can either turn on debug mode in Amavisd and SpamAssassin to check how bayes learning works in SpamAssassin, or run sa-learn manually to check it with a sample email.

To check on command line, please upload/save a sample email to /opt/sample.eml, then run sa-learn as root user:

# su -s /bin/bash amavis -c "spamassassin -D bayes < /opt/sample.eml"
May 21 05:27:08.244 [32241] dbg: bayes: learner_new self=Mail::SpamAssassin::Plugin::Bayes=HASH(0x2fe8cb8), bayes_store_module=Mail::SpamAssassin::BayesStore::MySQL
May 21 05:27:08.264 [32241] dbg: bayes: using username: amavis
May 21 05:27:08.264 [32241] dbg: bayes: learner_new: got store=Mail::SpamAssassin::BayesStore::MySQL=HASH(0x387a1c8)
M
...

Check bayes data

Run sa-learn with --dump argument will show the bayes data like below:

# sa-learn --dump magic

0.000          0          3          0  non-token data: bayes db version
0.000          0    3778575          0  non-token data: nspam
0.000          0    6326326          0  non-token data: nham
0.000          0     539978          0  non-token data: ntokens
0.000          0 1558372204          0  non-token data: oldest atime
0.000          0 1558415857          0  non-token data: newest atime
0.000          0          0          0  non-token data: last journal sync atime
0.000          0 1558415403          0  non-token data: last expiry atime
0.000          0      43200          0  non-token data: last expire atime delta
0.000          0      59325          0  non-token data: last expire reduction count

References