Product manual
GFI MailEssentials 14 Appendix 1 - Bayesian Filtering | 265
14 Appendix 1 - Bayesian Filtering
The Bayesian filter is an anti-spam technology used within GFI MailEssentials. It is an adaptive
technique based on artificial intelligence algorithms, hardened to withstand the widest range of
spamming techniques available today.
This chapter explains how the Bayesian filter works, how it can be configured and how it can be
trained.
NOTE
1. The Bayesian anti-spam filter is disabled by default. It is highly recommended that
you train the Bayesian filter before enabling it.
2. GFI MailEssentials must operate for at least one week for the Bayesian filter to
achieve its optimal performance. This is required because the Bayesian filter acquires
its highest detection rate when it adapts to your email patterns.
How does the Bayesian spam filter work?
Bayesian filtering is based on the principle that most events are dependent and that the probability of
an event occurring in the future can be inferred from the previous occurrences of that event.
NOTE
Refer to the links below for more information on the mathematical basis of Bayesian
filtering:
http://go.gfi.com/?pageid=ME_BayesianParameterEstimation
This same technique has been adapted by GFI MailEssentials to identify and classify spam. If a snippet
of text frequently occurs in spam emails but not in legitimate emails, it would be reasonable to
assume that this email is probably spam.
Creating a tailor-made Bayesian word database
Before Bayesian filtering is used, a database with words and tokens (for example $ sign, IP addresses
and domains, etc,) must be created. This can be collected from a sample of spam email and valid
email (referred to as ‘ham’).
A probability value is then assigned to each word or token; this is based on calculations that account
for how often such word occurs in spam as opposed to ham. This is done by analyzing the users'
outbound email and known spam: All the words and tokens in both pools of email are analyzed to
generate the probability that a particular word points to the email being spam.
This probability is calculated as per following example:
If the word ‘mortgage’ occurs in 400 out of 3,000 spam emails and in 5 out of 300 legitimate emails
then its spam probability would be 0.8889 (i.e. [400/3000] / [5/300 + 400/3000]).
Creating a custom ham email database
The analysis of ham email is performed on the company's email and therefore is tailored to that
particular company.