Bayesian
Filtering Example
Using Bayes'
Formula to keep spam out of your Inbox
Bayes’ Formula
Thomas Bayes was born in 1702 in London, the son of
a minister. After being educated
privately, he was ordained a minister like his father and was assigned to a
chapel in Tunbridge Wells, 35 miles outside of London. After Bayes’ death in 1761, his friend
Richard Price discovered his theory of probability in his papers. The theory was published by the Royal Society
in 1764.
In basic terms, Bayes’ Formula allows us to
determine the probability of an event occurring based on the probabilities of
two or more independent evidentiary events.
Mathematically, the general formula is represented as:

Assuming that the variables a and b are the probabilities of two evidentiary events, the probability would be equal to:
ab
ab
+ (1 – a)(1 - b)
For three evidentiary events a, b, and c, the formula expands so the probability is equal to:
abc
abc + (1 – a)(1 - b)(1 – c)
In this fashion, the formula can be expanded to accommodate any number of evidentiary events.
This document introduces Bayes’ Formula and provides an in-depth example of how a Bayesian filter can be used to classify spam email messages. A more general overview of Bayesian filtering is contained in the Introduction to Bayesian Filtering whitepaper, available from Process Software’s website at http://www.process.com.
A Simple Example
Suppose that CheapSkies Airlines flights between
Boston and New York City are delayed 75% of the time if it’s raining. Also suppose that if a flight is scheduled
to leave Boston before noon, it’s only delayed 10% percent of the time (rain or
shine). If
you take a CheapSkies flight from Boston to New York
City on a rainy day, and the flight is scheduled to depart before noon, what
are the odds your flight will be delayed?
Since there are only two pieces of evidence to
consider (the weather conditions and the scheduled departure time), we can use
the basic form of Bayes’ Formula to solve this problem. The probability that the flight will be delayed
on a rainy day (75%, or 0.75) is represented by the variable a, and the
probability that the flight will be delayed if it’s scheduled to leave before
noon (10%, or 0.10) is represented by the variable b.
Filling in Bayes’ Formula from above, we see that
the probability is equal to:
(0.75)(0.10)
(0.75)(0.10)
+ (1 – 0.75)(1 - 0.10)
Solving this equation yields a probability of 0.25,
or a 25% chance that your flight will be delayed.
An important observation from this example is that
we’re dealing with independent events – the probability of one event has
no impact on the other event. In the
case of our example, there’s a 75% chance the flight will be delayed on a rainy
day regardless of whether or not it’s scheduled to leave before noon. The probability of 75% includes both cases
where the flight leaves before noon, and cases where it doesn’t. Likewise, the fact that there’s a 10% chance
of the flight being delayed if it leaves before noon takes into account all
flights – not just ones that leave on rainy days.
Using this concept to filter spam messages is known
as naive Bayesian filtering, because we don’t take into account the
relationships between the various words contained in email messages. While it may certainly be true that a
message containing all three of the words “clinical”, “trial”, and “Viagra” is
never spam, all the naive Bayesian filter knows is that the words “clinical”
and “trial” occur mostly in non-spam messages while the word “Viagra” occurs
mostly in spam messages.
Spam Filtering Example
In the real world, applications for Bayes’ Formula
are messier and more complicated than the contrived example in the previous
section. Following is a complete
example of an email message being filtered by a Bayesian filter similar to the
one included in Process Software’s PreciseMail Anti-Spam Gateway.
For our example, we’re going to use the following
“Nigerian spam” message. Note that
we’re looking at the complete message – headers and all.
Figure 1: Sample Spam
Message
Received: from unknown
(HELO incamail.com) (209.11.24.18)
by venice.example.com with SMTP; 4 May 2003
14:15:35 -0000
Received: from [10.1.1.27]
(HELO app2.incamail.com)
by incamail.com (CommuniGate Pro SMTP 4.0.6)
with ESMTP id 2217203; Sun, 04 May 2003 10:12:16 -0400
Message-ID:
<6549662.1052057538895.JavaMail.tomcat@app2.incamail.com>
From: BUMA SARO WIWA
<bsarowiwa@incamail.com>
To: bsarowiwa@incamail.com
Subject: URGENT ASSISTANCE
PLEAse
Mime-Version: 1.0
Content-Type: text/plain;
charset=us-ascii
Content-Transfer-Encoding:
7bit
X-Priority: 3
X-Suffix: INBOX
Date: Sun, 04 May 2003
10:12:16 -0400
Content-Length: 2388
Princess Buma Saro-Wiwa
101 Younde avenue YD
2390 Cameroun.
bsarowiwa@incamail.com OR
b_sarowiwa@yahoo.com.au
Dear Friend,
I got your contact from a
directory in a library in one of our international school in my country and my
instinct tells me to write you and i feel It will be a great pleasure to be in
contact with someone like you.
frist, let me introduce
myself, my name is PrincessBuma Nene Saro Wiwa Ken. I am 27 years old from a
royal family of Ken sarowiwa Kings hence I bear the tittle "PRINCESS"
I am single and the only duagther of my parents.my father was a royal king of
OGONI a prominent community in Rivers state Nigeria who was killed through
hanging by the order of late Gen sani Abacha because of his community
inheritance which are ( crude oil) that the F.G.N has taken possession of it.
We are only two, I and my
younger brother KEN SARO WIWA[jnr],after one year death of my father, my mother
died of High Blood preasure (HBP).Meanwhile, we inherited some fortune in form
of cash which I will reveal to you when we get your response.Our old family
friends have been very dishonest with us since the death of our parents, they
have duped us of virtually all cash in the banks with different stories and
reason. As such we decided to cut off relationship from people around us
because we find out that they have on motive to squander what is left. We had
to leave Nigeria to stay in neighbuoring cameroun republic with the assistance
of our family lawyer in Nigeria, we are here now for three years and would like
to move out to another continent.I am interested to enter into strong relation
with you as a friend and partner after i have gotten good information about you
on internet.To be frank, we need someone who is kind and sincere that will
assist us.
We are interested to
invest and live in your country therefore, it will be our pleasure if you can
be of help to us by assisting us to handle the investment and planing of our
fortune we inherited, to enable us build a new home for safekeeping of our
lives.
Please let me receive your
response urgently.My kindest compliments.
Yours Faithfully,
Princess B. Saro-Wiwa.
bsarowiwa@incamail.com OR
b_sarowiwa@yahoo.com.au
------------------------------------------------------------
Tired of spam and email
overload?
Get a FREE 6MB email
account at http://www.incamail.com
The first thing a Bayesian filter must do is split
the message into tokens and build a table of all the tokens it intends to use
in the decision making process. For our
sample message, the table would be:
Figure 2: Spam Message Token
Table
10.1.1.27 209.11.24.18 abacha about
account after all and
another app2.incamail.com are around
assist assistance assisting avenue
banks bear because been
bit blood brother bsarowiwa
build buma cameroun can
cash charset communigate community
compliments contact content-length content-type
continent.i country crude cut
dear death decided died
different directory dishonest duagther
duped email enable enter
esmtp f.g.n faithfully family
father feel find for
form fortune frank free
friend friends frist from
gen get good got
gotten great had handle
hanging has have hbp
helo help hence here
high his home http
inbox incamail.com information inheritance
inherited instinct interested international
internet.to into introduce invest
investment jnr ken killed
kind kindest king kings
late lawyer leave left
let library like live
lives may meanwhile mime-version
mother motive move myself
name need neighbuoring nene
new nigeria now off
ogoni oil old one
only order our out
overload parents parents.my partner
people plain planing please
pleasure possession preasure princess
princessbuma pro prominent reason
receive received relation relationship
republic response response.our reveal
rivers royal safekeeping sani
saro saro-wiwa sarowiwa school
since sincere single smtp
some someone spam squander
state stay stories strong
subject such sun taken
tells text that the
therefore they three through
tired tittle two unknown
urgent urgently.my us-ascii venice.example.com
very virtually was what
when which who will
with wiwa would write
www.incamail.com x-priority x-suffix yahoo.com.au
year years you younde
younger your yours
Once the Bayesian filter has the list of tokens in
the message, it searches the spam and non-spam token databases for these
tokens. These databases of tokens are
created and updated whenever the Bayesian filter is “trained” on a new
message.
If a token from the message is found in the
databases, the Bayesian filter calculates the token’s spamicity based on the
following variables:
The algorithm used to calculate a token’s spamicity
from these pieces of information is as follows:
Ham probability = Token frequency
in ham messages / Number of ham messages trained on
Spam probability = Token frequency
in spam messages / Number of spam messages trained on
If either Ham probability or Spam
probability are greater than 1.0, set them equal to 1.0.
Spamicity = Spam probability
/ (Ham probability + Spam probability)
If a token has occurred less than 5 times total in
both ham and spam messages, the token is assigned a default spamicity of 0.4. The following example and table use a set of
sample token databases generated by live mail feed on a test system at Process
Software. The Bayesian filter was
trained on 19,977 spam messages and 5,141 ham messages.
An example of this algorithm, using the token
“after” from the example spam message and frequency values from Figure 3, is:
Ham probability =
1184 / 5141 = 0.230305
Spam probability =
1134 / 19977 = 0.056765
Spamicity =
0.056765 / (0.056765 + 0.230305) = 0.197740
This tells us that there’s only a 19.8% chance that
a message containing the word “after” is a spam message.
Repeating this process for each of the tokens in our
sample message, we get the following frequencies and spamicities:
Figure 3: Spam Message Token
Frequency and Spamicity Table
|
Token |
Spam Frequency |
Ham Frequency |
Spamicity |
|
10.1.1.27 |
0 |
0 |
0.400000 |
|
209.11.24.18 |
0 |
0 |
0.400000 |
|
abacha |
14 |
2 |
0.643038 |
|
about |
3301 |
2578 |
0.247848 |
|
account |
585 |
563 |
0.210984 |
|
after |
1134 |
1184 |
0.197740 |
|
all |
9767 |
3759 |
0.400717 |
|
and |
32109 |
12353 |
0.500000 |
|
another |
1305 |
784 |
0.299898 |
|
app2.incamail.com |
0 |
0 |
0.400000 |
|
are |
13555 |
6130 |
0.404241 |
|
around |
433 |
480 |
0.188409 |
|
assist |
256 |
46 |
0.588847 |
|
assistance |
386 |
171 |
0.367453 |
|
assisting |
6 |
4 |
0.278509 |
|
avenue |
70 |
25 |
0.418797 |
|
banks |
238 |
8 |
0.884474 |
|
bear |
80 |
12 |
0.631763 |
|
because |
5114 |
973 |
0.574936 |
|
been |
3233 |
2036 |
0.290097 |
|
bit |
4296 |
2292 |
0.325398 |
|
blood |
383 |
53 |
0.650312 |
|
brother |
171 |
171 |
0.403703 |
|
bsarowiwa |
0 |
0 |
0.400000 |
|
build |
3364 |
576 |
0.600475 |
|
buma |
0 |
0 |
0.400000 |
|
cameroun |
0 |
0 |
0.400000 |
|
can |
8083 |
4568 |
0.312889 |
|
cash |
1318 |
49 |
0.873771 |
|
charset |
9300 |
3324 |
0.418608 |
|
communigate |
16 |
61 |
0.063232 |
|
community |
70 |
76 |
0.191612 |
|
compliments |
58 |
58 |
0.788651 |
|
contact |
1552 |
760 |
0.344489 |
|
content-length |
0 |
0 |
0.400000 |
|
content-type |
26907 |
5054 |
0.504267 |
|
continent.i |
0 |
0 |
0.400000 |
|
country |
316 |
62 |
0.567406 |
|
crude |
19 |
0 |
0.990000 |
|
cut |
272 |
199 |
0.260218 |
|
dear |
752 |
113 |
0.631350 |
|
death |
118 |
37 |
0.450768 |
|
decided |
205 |
107 |
0.330228 |
|
died |
44 |
31 |
0.267542 |
|
different |
593 |
704 |
0.178152 |
|
directory |
57 |
401 |
0.035289 |
|
dishonest |
0 |
0 |
0.400000 |
|
duagther |
0 |
0 |
0.400000 |
|
duped |
0 |
0 |
0.400000 |
|
email |
13820 |
2097 |
0.629081 |
|
enable |
65 |
97 |
0.147084 |
|
enter |
753 |
139 |
0.582309 |
|
esmtp |
7239 |
7152 |
0.265983 |
|
f.g.n |
0 |
0 |
0.400000 |
|
faithfully |
35 |
0 |
0.990000 |
|
family |
3255 |
172 |
0.829646 |
|
father |
75 |
38 |
0.336835 |
|
feel |
2269 |
299 |
0.661350 |
|
find |
2966 |
854 |
0.471956 |
|
for |
29946 |
14355 |
0.500000 |
|
form |
2721 |
258 |
0.730756 |
|
fortune |
211 |
16 |
0.772404 |
|
frank |
47 |
85 |
0.124571 |
|
free |
13077 |
948 |
0.780215 |
|
friend |
456 |
110 |
0.516164 |
|
friends |
1215 |
181 |
0.633362 |
|
frist |
0 |
0 |
0.400000 |
|
from |
65251 |
18549 |
0.500000 |
|
gen |
63 |
14 |
0.536620 |
|
get |
10853 |
2876 |
0.492677 |
|
good |
1426 |
1752 |
0.173185 |
|
got |
946 |
998 |
0.196101 |
|
gotten |
49 |
35 |
0.264860 |
|
great |
1761 |
556 |
0.449061 |
|
had |
1202 |
1709 |
0.153260 |
|
handle |
201 |
103 |
0.334309 |
|
hanging |
39 |
51 |
0.164434 |
|
has |
3661 |
2693 |
0.259176 |
|
have |
11235 |
7113 |
0.359958 |
|
hbp |
0 |
0 |
0.400000 |
|
helo |
1855 |
1473 |
0.244761 |
|
help |
2364 |
1406 |
0.302014 |
|
hence |
36 |
16 |
0.366699 |
|
high |
2032 |
265 |
0.663674 |
|
his |
815 |
712 |
0.227545 |
|
home |
3510 |
650 |
0.581532 |
|
http |
57485 |
4233 |
0.548432 |
|
inbox |
74 |
91 |
0.173055 |
|
incamail.com |
0 |
0 |
0.400000 |
|
information |
4197 |
1490 |
0.420252 |
|
inheritance |
0 |
0 |
0.400000 |
|
inherited |
0 |
5 |
0.010000 |
|
instinct |
0 |
0 |
0.400000 |
|
interested |
592 |
237 |
0.391291 |
|
international |
1392 |
165 |
0.684648 |
|
internet.to |
0 |
0 |
0.400000 |
|
into |
1359 |
1268 |
0.216187 |
|
introduce |
53 |
20 |
0.405458 |
|
invest |
139 |
7 |
0.836338 |
|
investment |
657 |
31 |
0.845059 |
|
jnr |
0 |
0 |
0.400000 |
|
ken |
0 |
0 |
0.400000 |
|
killed |
10 |
25 |
0.093331 |
|
kind |
130 |
266 |
0.111720 |
|
kindest |
0 |
0 |
0.400000 |
|
king |
210 |
117 |
0.315960 |
|
kings |
8 |
24 |
0.079005 |
|
late |
181 |
221 |
0.174078 |
|
lawyer |
31 |
9 |
0.469894 |
|
leave |
141 |
189 |
0.161066 |
|
left |
9847 |
488 |
0.838522 |
|
let |
1007 |
987 |
0.207959 |
|
library |
242 |
274 |
0.185197 |
|
like |
6794 |
2752 |
0.388500 |
|
live |
667 |
166 |
0.508366 |
|
lives |
106 |
47 |
0.367248 |
|
may |
4255 |
2102 |
0.342510 |
|
meanwhile |
3 |
13 |
0.056058 |
|
mime-version |
17646 |
4370 |
0.509602 |
|
mother |
76 |
45 |
0.302956 |
|
motive |
0 |
0 |
0.400000 |
|
move |
403 |
336 |
0.235861 |
|
myself |
103 |
110 |
0.194178 |
|
name |
10101 |
1624 |
0.615480 |
|
need |
2714 |
1813 |
0.278103 |
|
neighbuoring |
0 |
0 |
0.400000 |
|
nene |
0 |
0 |
0.400000 |
|
new |
9051 |
2191 |
0.515291 |
|
nigeria |
132 |
2 |
0.944398 |
|
now |
8920 |
2034 |
0.530203 |
|
off |
3061 |
835 |
0.485437 |
|
ogoni |
0 |
0 |
0.400000 |
|
oil |
64 |
42 |
0.281685 |
|
old |
949 |
731 |
0.250427 |
|
one |
8722 |
2995 |
0.428388 |
|
only |
4954 |
2298 |
0.356824 |
|
order |
4442 |
680 |
0.627015 |
|
our |
16869 |
1634 |
0.726535 |
|
out |
5565 |
2829 |
0.336092 |
|
overload |
0 |
5 |
0.010000 |
|
parents |
119 |
61 |
0.334237 |
|
parents.my |
0 |
0 |
0.400000 |
|
partner |
509 |
39 |
0.770574 |
|
people |
1808 |
828 |
0.359768 |
|
plain |
954 |
3206 |
0.071131 |
|
planing |
0 |
0 |
0.400000 |
|
please |
11780 |
2108 |
0.589846 |
|
pleasure |
117 |
13 |
0.698442 |
|
possession |
10 |
9 |
0.222359 |
|
preasure |
0 |
0 |
0.400000 |
|
princess |
0 |
0 |
0.400000 |
|
princessbuma |
0 |
0 |
0.400000 |
|
pro |
1388 |
102 |
0.777873 |
|
prominent |
6 |
0 |
0.990000 |
|
reason |
552 |
487 |
0.225823 |
|
receive |
8509 |
348 |
0.862871 |
|
received |
19967 |
10164 |
0.499875 |
|
relation |
20 |
3 |
0.631763 |
|
relationship |
133 |
69 |
0.331570 |
|
republic |
34 |
16 |
0.353529 |
|
response |
645 |
311 |
0.347992 |
|
response.our |
0 |
0 |
0.400000 |
|
reveal |
29 |
3 |
0.713276 |
|
rivers |
0 |
0 |
0.400000 |
|
royal |
168 |
16 |
0.729885 |
|
safekeeping |
10 |
0 |
0.990000 |
|
sani |
0 |
0 |
0.400000 |
|
saro |
0 |
0 |
0.400000 |
|
saro-wiwa |
0 |
0 |
0.400000 |
|
sarowiwa |
0 |
0 |
0.400000 |
|
school |
313 |
68 |
0.542239 |
|
since |
299 |
854 |
0.082654 |
|
sincere |
22 |
0 |
0.990000 |
|
single |
229 |
372 |
0.136755 |
|
smtp |
2374 |
1702 |
0.264140 |
|
some |
1981 |
2262 |
0.183924 |
|
someone |
728 |
517 |
0.265988 |
|
spam |
1167 |
956 |
0.239049 |
|
squander |
0 |
0 |
0.400000 |
|
state |
929 |
467 |
0.338597 |
|
stay |
453 |
201 |
0.367084 |
|
stories |
112 |
44 |
0.395793 |
|
strong |
10357 |
154 |
0.945377 |
|
subject |
22169 |
10497 |
0.500000 |
|
such |
1026 |
848 |
0.237435 |
|
sun |
2608 |
1611 |
0.294089 |
|
taken |
382 |
122 |
0.446225 |
|
tells |
11 |
29 |
0.088933 |
|
text |
19009 |
4012 |
0.549410 |
|
that |
10559 |
9075 |
0.345789 |
|
the |
34475 |
16621 |
0.500000 |
|
therefore |
117 |
122 |
0.197946 |
|
they |
2319 |
2640 |
0.184376 |
|
three |
607 |
245 |
0.389346 |
|
through |
4241 |
758 |
0.590138 |
|
tired |
227 |
128 |
0.313369 |
|
tittle |
0 |
0 |
0.400000 |
|
two |
775 |
940 |
0.175036 |
|
unknown |
2667 |
695 |
0.496866 |
|
urgent |
93 |
31 |
0.435678 |
|
urgently.my |
0 |
0 |
0.400000 |
|
us-ascii |
665 |
1891 |
0.082989 |
|
venice.example.com |
0 |
0 |
0.400000 |
|
very |
1173 |
980 |
0.235490 |
|
virtually |
136 |
18 |
0.660371 |
|
was |
3573 |
4367 |
0.173933 |
|
what |
3050 |
3548 |
0.181150 |
|
when |
2404 |
2614 |
0.191378 |
|
which |
1200 |
2132 |
0.126521 |
|
who |
2041 |
1183 |
0.307476 |
|
will |
9749 |
4255 |
0.370922 |
|
with |
39458 |
15761 |
0.500000 |
|
wiwa |
0 |
0 |
0.400000 |
|
would |
6023 |
3296 |
0.319851 |
|
write |
903 |
329 |
0.413948 |
|
www.incamail.com |
0 |
0 |
0.400000 |
|
x-priority |
11524 |
852 |
0.776826 |
|
x-suffix |
0 |
0 |
0.400000 |
|
yahoo.com.au |
0 |
0 |
0.400000 |
|
year |
1096 |
421 |
0.401182 |
|
years |
1397 |
503 |
0.416820 |
|
you |
40273 |
9606 |
0.500000 |
|
younde |
0 |
0 |
0.400000 |
|
younger |
250 |
4 |
0.941466 |
|
your |
31926 |
4534 |
0.531370 |
|
yours |
682 |
75 |
0.700611 |
Now that the filter has calculated the spamicity
value for each token in the message, it needs to choose 15 tokens that will be
plugged into the Bayesian formula to calculate the message’s overall
spamicity. Using a subset of the tokens
in the message enhances the Bayesian filter’s performance, especially when
dealing with large messages.
Early implementations of Bayesian filters chose the
15 tokens that had the most extreme values (i.e. the 15 tokens whose value was
furthest from the neutral value of 0.5).
Spammers have started including words that they’re fairly sure will have
a low spamicity, such as “congresswoman” and “umbrella”, in their messages in
an attempt to circumvent this system.
As a result, the Bayesian filter included in Process Software’s
PreciseMail Anti-Spam Gateway uses a sampling algorithm based on standards of
deviation to choose the 15 tokens fed to the Bayesian formula.
For our sample message, the 15 tokens chosen by the
Bayesian filter are:
Figure 4: Token Subset Used
in Bayesian Formula
|
Token |
Spamicity |
|
account |
0.210984 |
|
after |
0.197740 |
|
crude |
0.990000 |
|
faithfully |
0.990000 |
|
good |
0.173185 |
|
inherited |
0.010000 |
|
invest |
0.836338 |
|
investment |
0.845059 |
|
let |
0.207959 |
|
overload |
0.010000 |
|
prominent |
0.990000 |
|
receive |
0.862871 |
|
safekeeping |
0.990000 |
|
sincere |
0.990000 |
|
therefore |
0.197946 |
Once the Bayesian filter has selected 15 tokens, it
plugs their spamicity values into Bayes’ formula, as shown below. (With 15 different values, this gets a
little bit messy on paper.) For our
sample message, the probability of the message being spam is:
(0.210984)(0.197740)(0.990000)(0.990000)(0.173185)(0.010000)(0.836338)(0.845059)
(0.207959)(0.010000)(0.990000)(0.862871)(0.990000)(0.990000)(0.197946)
(0.210984)(0.197740)(0.990000)(0.990000)(0.173185)(0.010000)(0.836338)(0.845059)
(0.207959)(0.010000)(0.990000)(0.862871)(0.990000)(0.990000)(0.197946)
+
(1
- 0.210984)(1 - 0.197740)(1 - 0.990000)(1 - 0.990000)(1 - 0.173185)
(1
- 0.010000)(1 - 0.836338)(1 - 0.845059)(1 - 0.207959)(1 - 0.010000)
(1
- 0.990000)(1 - 0.862871)(1 - 0.990000)(1 - 0.990000)(1 - 0.197946)
This equation simplifies to:
0.000000017249220883574410361053715216318
0.000000017249334195201446371086
Solving this equation yields a probability of
0.999993, or a 99.9993% chance that the message is spam. If this message was sent to an email server
protected by PreciseMail Anti-Spam Gateway, it would be quarantined, discarded,
or tagged as spam based on the options chosen by the systems administrator.
Bayesian filtering is one method used by Process Software’s
PreciseMail Anti-Spam Gateway to keep junk email out of your Inbox.
For more information on Bayesian filtering, including a more general
overview, visit the Process Software website at http://www.process.com/.
A free demonstration of PreciseMail Anti-Spam Gateway is also available
from the Process Software website, so you can try Bayesian filtering on your
email server.
Process Software | 959 Concord Street, Framingham, MA 01701 | 800-722-7770; 508-879-6994 | fax 508-879-0042 www.process.com