Product manual

GFI MailEssentials 14 Appendix 1 - Bayesian Filtering | 265

14 Appendix 1 - Bayesian Filtering

The Bayesian filter is an anti-spam technology used within GFI MailEssentials. It is an adaptive

technique based on artificial intelligence algorithms, hardened to withstand the widest range of

spamming techniques available today.

This chapter explains how the Bayesian filter works, how it can be configured and how it can be

trained.

NOTE

1. The Bayesian anti-spam filter is disabled by default. It is highly recommended that

you train the Bayesian filter before enabling it.

2. GFI MailEssentials must operate for at least one week for the Bayesian filter to

achieve its optimal performance. This is required because the Bayesian filter acquires

its highest detection rate when it adapts to your email patterns.

How does the Bayesian spam filter work?

Bayesian filtering is based on the principle that most events are dependent and that the probability of

an event occurring in the future can be inferred from the previous occurrences of that event.

NOTE

Refer to the links below for more information on the mathematical basis of Bayesian

filtering:

http://go.gfi.com/?pageid=ME_BayesianParameterEstimation

This same technique has been adapted by GFI MailEssentials to identify and classify spam. If a snippet

of text frequently occurs in spam emails but not in legitimate emails, it would be reasonable to

assume that this email is probably spam.

Creating a tailor-made Bayesian word database

Before Bayesian filtering is used, a database with words and tokens (for example $ sign, IP addresses

and domains, etc,) must be created. This can be collected from a sample of spam email and valid

email (referred to as ‘ham’).

A probability value is then assigned to each word or token; this is based on calculations that account

for how often such word occurs in spam as opposed to ham. This is done by analyzing the users'

outbound email and known spam: All the words and tokens in both pools of email are analyzed to

generate the probability that a particular word points to the email being spam.

This probability is calculated as per following example:

If the word ‘mortgage’ occurs in 400 out of 3,000 spam emails and in 5 out of 300 legitimate emails

then its spam probability would be 0.8889 (i.e. [400/3000] / [5/300 + 400/3000]).

Creating a custom ham email database

The analysis of ham email is performed on the company's email and therefore is tailored to that

particular company.