Bayesian Filtering Example

 

Bayesian Filtering Example (PDF)

Download a PDF version of this document

Bayes’ Formula

Thomas Bayes was born in 1702 in London, the son of a minister. After being educated privately, he was ordained a minister like his father and was assigned to a chapel in Tunbridge Wells, 35 miles outside of London. After Bayes’ death in 1761, his friend Richard Price discovered his theory of probability in his papers. The theory was published by the Royal Society in 1764.

In basic terms, Bayes’ Formula allows us to determine the probability of an event occurring based on the probabilities of two or more independent evidentiary events. Mathematically, the general formula is represented as:

Assuming that the variables a and b are the probabilities of two evidentiary events, the probability would be equal to:

For three evidentiary events a, b, and c, the formula expands so the probability is equal to:

In this fashion, the formula can be expanded to accommodate any number of evidentiary events.

This document introduces Bayes’ Formula and provides an in-depth example of how a Bayesian filter can be used to classify spam e-mail messages. A more general overview of Bayesian filtering is contained in the Introduction to Bayesian Filtering.

 

A Simple Example

Suppose that CheapSkies Airlines flights between Boston and New York City are delayed 75% of the time if it’s raining. Also suppose that if a flight is scheduled to leave Boston before noon, it’s only delayed 10% percent of the time (rain or shine). If you take a CheapSkies flight from Boston to New York City on a rainy day, and the flight is scheduled to depart before noon, what are the odds your flight will be delayed?

Since there are only two pieces of evidence to consider (the weather conditions and the scheduled departure time), we can use the basic form of Bayes’ Formula to solve this problem. The probability that the flight will be delayed on a rainy day (75%, or 0.75) is represented by the variable a, and the probability that the flight will be delayed if it’s scheduled to leave before noon (10%, or 0.10) is represented by the variable b.

Filling in Bayes’ Formula from above, we see that the probability is equal to:

Solving this equation yields a probability of 0.25, or a 25% chance that your flight will be delayed.

An important observation from this example is that we’re dealing with independent events – the probability of one event has no impact on the other event. In the case of our example, there’s a 75% chance the flight will be delayed on a rainy day regardless of whether or not it’s scheduled to leave before noon. The probability of 75% includes both cases where the flight leaves before noon, and cases where it doesn’t. Likewise, the fact that there’s a 10% chance of the flight being delayed if it leaves before noon takes into account all flights – not just ones that leave on rainy days.

Using this concept to filter spam messages is known as naive Bayesian filtering, because we don’t take into account the relationships between the various words contained in email messages. While it may certainly be true that a message containing all three of the words “clinical”, “trial”, and “Viagra” is never spam, all the naive Bayesian filter knows is that the words “clinical” and “trial” occur mostly in non-spam messages while the word “Viagra” occurs mostly in spam messages.

 

Spam Filtering Example

In the real world, applications for Bayes’ Formula are messier and more complicated than the contrived example in the previous section. Following is a complete example of an e-mail message being filtered by a Bayesian filter similar to the one included in Process Software’s PreciseMail Anti-Spam Gateway.

For our example, we’re going to use the following “Nigerian spam” message. Note that we’re looking at the complete message – headers and all.

Received: from unknown (HELO incamail.com) (209.11.24.18)
  by venice.example.com with SMTP; 4 May 2003 14:15:35 -0000
Received: from [10.1.1.27] (HELO app2.incamail.com)
  by incamail.com (CommuniGate Pro SMTP 4.0.6)
  with ESMTP id 2217203; Sun, 04 May 2003 10:12:16 -0400
Message-ID: <6549662.1052057538895.JavaMail.tomcat@app2.incamail.com>
From: BUMA SARO WIWA 
To: bsarowiwa@incamail.com
Subject: URGENT ASSISTANCE PLEAse
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-Priority: 3
X-Suffix: INBOX
Date: Sun, 04 May 2003 10:12:16 -0400
Content-Length: 2388

   Princess Buma Saro-Wiwa
101 Younde avenue YD
2390 Cameroun.
bsarowiwa@incamail.com OR b_sarowiwa@yahoo.com.au

Dear Friend,

I got your contact from a directory in a library in one of our international
school in my country and my instinct tells me to write you and i feel It will
be a great pleasure to be in contact with someone like you.

frist, let me introduce myself, my name is PrincessBuma Nene Saro Wiwa Ken. I
am 27 years old from a royal family of Ken sarowiwa Kings hence I bear the
tittle "PRINCESS" I am single and the only duagther of my parents.my father
was a royal king of OGONI a prominent community in Rivers state Nigeria who
was killed through hanging by the order of late Gen sani Abacha because of
his community inheritance which are ( crude oil) that the F.G.N has taken
possession of it.

We are only two, I and my younger brother KEN SARO WIWA[jnr],after one year
death of my father, my mother died of High Blood preasure (HBP).Meanwhile, we
inherited some fortune in form of cash which I will reveal to you when we get
your response.Our old family friends have been very dishonest with us since the
death of our parents, they have duped us of virtually all cash in the banks
with different stories and reason. As such we decided to cut off relationship
from people around us because we find out that they have on motive to squander
what is left. We had to leave Nigeria to stay in neighbuoring cameroun republic
with the assistance of our family lawyer in Nigeria, we are here now for three
years and would like to move out to another continent.I am interested to enter
into strong relation with you as a friend and partner after i have gotten good
information about you on internet.To be frank, we need someone who is kind and
sincere that will assist us.

We are interested to invest and live in your country therefore, it will be our
pleasure if you can be of help to us by assisting us to handle the investment
and planing of our fortune we inherited, to enable us build a new home for
safekeeping of our lives.

Please let me receive your response urgently.My kindest compliments.

Yours Faithfully,
Princess B. Saro-Wiwa.
bsarowiwa@incamail.com OR b_sarowiwa@yahoo.com.au

------------------------------------------------------------
Tired of spam and email overload?
Get a FREE 6MB email account at http://www.incamail.com

The first thing a Bayesian filter must do is split the message into tokens and build a table of all the tokens it intends to use in the decision making process. For our sample message, the table would be:

10.1.1.27	209.11.24.18		abacha		about
account		after			all		and
another		app2.incamail.com	are		around
assist		assistance		assisting	avenue
banks		bear			because		been
bit		blood			brother		bsarowiwa
build		buma			cameroun	can
cash		charset			communigate	community
compliments	contact			content-length	content-type
continent.i	country			crude		cut
dear		death			decided		died
different	directory		dishonest	duagther
duped		email			enable		enter
esmtp		f.g.n			faithfully	family
father		feel			find		for
form		fortune			frank		free
friend		friends			frist		from
gen		get			good		got
gotten		great			had		handle
hanging		has			have		hbp
helo		help			hence		here
high		his			home		http
inbox		incamail.com		information	inheritance
inherited	instinct		interested	international
internet.to	into			introduce	invest
investment	jnr			ken		killed
kind		kindest			king		kings
late		lawyer			leave		left
let		library			like		live
lives		may			meanwhile	mime-version
mother		motive			move		myself
name		need			neighbuoring	nene
new		nigeria			now		off
ogoni		oil			old		one
only		order			our		out
overload	parents			parents.my	partner
people		plain			planing		please
pleasure	possession		preasure	princess
princessbuma	pro			prominent	reason
receive		received		relation	relationship
republic	response		response.our	reveal
rivers		royal			safekeeping	sani
saro		saro-wiwa		sarowiwa	school
since		sincere			single		smtp
some		someone			spam		squander
state		stay			stories		strong
subject		such			sun		taken
tells		text			that		the
therefore	they			three		through
tired		tittle			two		unknown
urgent		urgently.my		us-ascii	venice.example.com
very		virtually		was		what
when		which			who		will
with		wiwa			would		write
www.incamail.com x-priority		x-suffix	yahoo.com.au
year		years			you		younde
younger		your			yours

Once the Bayesian filter has the list of tokens in the message, it searches the spam and non-spam token databases for these tokens. These databases of tokens are created and updated whenever the Bayesian filter is “trained” on a new message.

If a token from the message is found in the databases, the Bayesian filter calculates the token’s spamicity based on the following variables:

  • The frequency of the token in spam messages that the filter has been trained on
  • The frequency of the token in ham messages that the filter has been trained on
  • The number of spam messages the filter has been trained on
  • The number of ham messages the filter has been trained on

The algorithm used to calculate a token’s spamicity from these pieces of information is as follows:

Ham probability = Token frequency in ham messages / Number of ham messages trained on

Spam probability = Token frequency in spam messages / Number of spam messages trained on

If either Ham probability or Spam probability are greater than 1.0, set them equal to 1.0.

Spamicity = Spam probability / (Ham probability + Spam probability)

If a token has occurred less than 5 times total in both ham and spam messages, the token is assigned a default spamicity of 0.4. The following example and table use a set of sample token databases generated by live mail feed on a test system at Process Software. The Bayesian filter was trained on 19,977 spam messages and 5,141 ham messages.

An example of this algorithm, using the token “after” from the example spam message and frequency values in the above tables is:

Ham probability = 1184 / 5141 = 0.230305
Spam probability = 1134 / 19977 = 0.056765
Spamicity = 0.056765 / (0.056765 + 0.230305) = 0.197740

This tells us that there’s only a 19.8% chance that a message containing the word “after” is a spam message.

Repeating this process for each of the tokens in our sample message, we get the following frequencies and spamicities:

Token Spam Frequency Ham Frequency Spamicity
10.1.1.27 0 0 0.400000
209.11.24.18 0 0 0.400000
abacha 14 2 0.643038
about 3301 2578 0.247848
account 585 563 0.210984
after 1134 1184 0.197740
all 9767 3759 0.400717
and 32109 12353 0.500000
another 1305 784 0.299898
app2.incamail.com 0 0 0.400000
-
-
-
-
x-priority 11524 852 0.776826
x-suffix 0 0 0.400000
yahoo.com.au 0 0 0.400000
year 1096 421 0.401182
years 1397 503 0.416820
you 40273 9606 0.500000
younde 0 0 0.400000
younger 250 4 0.941466
your 31926 4534 0.531370
yours 682 75 0.700611

Now that the filter has calculated the spamicity value for each token in the message, it needs to choose 15 tokens that will be plugged into the Bayesian formula to calculate the message’s overall spamicity. Using a subset of the tokens in the message enhances the Bayesian filter’s performance, especially when dealing with large messages.

Early implementations of Bayesian filters chose the 15 tokens that had the most extreme values (i.e. the 15 tokens whose value was furthest from the neutral value of 0.5). Spammers have started including words that they’re fairly sure will have a low spamicity, such as “congresswoman” and “umbrella”, in their messages in an attempt to circumvent this system. As a result, the Bayesian filter included in PreciseMail uses a sampling algorithm based on standards of deviation to choose the 15 tokens fed to the Bayesian formula.

For our sample message, the 15 tokens chosen by the Bayesian filter are:

Token		Spamicity
-----		---------
account		0.210984
after		0.197740
crude		0.990000
faithfully	0.990000
good		0.173185
inherited	0.010000
invest		0.836338
investment	0.845059
let		0.207959
overload	0.010000
prominent	0.990000
receive		0.862871
safekeeping	0.990000
sincere		0.990000
therefore	0.197946

Once the Bayesian filter has selected 15 tokens, it plugs their spamicity values into Bayes’ formula, as shown below. (With 15 different values, this gets a little bit messy on paper.) For our sample message, the probability of the message being spam is:

This equation simplifies to:

Solving this equation yields a probability of 0.999993, or a 99.9993% chance that the message is spam. If this message was sent to an email server protected by PreciseMail, it would be quarantined, discarded, or tagged as spam based on the options chosen by the systems administrator.