Computer Science: Algorithms & Data Science: Bayesian Learning , The Naïve Bayes Model

Bayesian Learning , The Naïve Bayes Model

Naïve Bayes is a simple probabilistic classifier based on applying Bayes theorem with assumption of independence between features.

Explanation of Baye’s rule:

Baye’s rule states that p(h|e) = p(e|h) * p(h) / p(e)

The probability of a hypothesis or event (h) occurring can be predicted based on evidences (e) that can be observed from prior events.

Some important terms are:

Priori probability– The probability of p(h) event before the evidence is observed.

Posterior probability – The probability of p(h|e) event after evidence is observed.

Naïve Bays can be used in such areas as document classification (spam filtering, website classification, etc..) and also for events who’ s features can be safely assumed independent or that establishing dependence with be too costly.

Let H be the event of “fever” and E the evidence of “sore throat” , then we have

P( fever | sore throat) = P( sore throat | fever) * P (fever) / P (dark cloud )

P( sore throat | fever ) this is the probability that the person has “sore throat” given or during a “fever”.

P(fever) – is the Prior probability. This can be obtained from statistical medical records such as number of people who had a fever when visiting the doctor this year.

P( sore throat) is probability of evidence occurring , This can be obtained from statistical medical records such as number of people who had a sore throats when visiting the doctor this year.

As one can observe from the above example, we can predict an outcome of some events by observing some evidences; the more evidences the better prediction. When including more evidences for building our NB model, we could run into a problem of dependencies. For example including the evidence “excessive coughing” might be due to “sore throat” can make the model complicated. Therefore we assume that all evidences are “independent” of each other thus “naïve”.

Bayes rule for multiple evidences with independence

P(H | E1, E2, ..., En) = P( E1 | H) x P( E2 | H) x ... P( En | H) x P(H)

P(E1, E2, ..., En)

Example 1: Lets try and build an NB model. Using weather example from the book “Data Mining , Practical Machine Learning Tools & Techniques”

Predict if the team will “play” given the features “outlook, temperature, humidity, and windy”

Lets build frequency table of different evidences per feature & classification outcome.

The above table , tabulates all the data in one place in order to make comparison a little easier each feature value such as Outlook = sunny has frequency of Yes versus no No classification. The bottom portion has the relative frequency expressed in fractions, such as p( Outlook=sunny |Play = yes) = 2/9 p( Outlook= sunny | Play = no ) =3/5.

Now that we have created the NB model via the table above we can utilize this model to predict the likelihood event “play” based on different evidence values. For example ,

P [ yes | outlook = rainy , temperature = cool, humidity = high, windy = false] =

P[rainy | yes] * P [mild | yes] * P [high | yes] * P[ false | yes] * P [yes] / P [rainy*cool*high*false] =

3/9 * 4/9 * 3/9 * 6/ 9 * 9/14, we can ignore P[Evidences] = 0.02116448

Also likelihood of :

P [ no | outlook = rainy , temperature = cool, humidity = high, windy = false] = 3/9 * 4/9 * 3/9 * 6/ 9 * 5/14 = 0.01175778

Finally will convert the above likelihood calculations into probability by normalization

P[yes] = 0.02116 / ( 0.01175 + 0.02116) = 0.6431 or about 64.31 % chance of rain given the evidences

P[No]= 0.01175 / ( 0.01175 + 0.02116) = 0.3568 or about 36.68% chance of no rain given evidences

How to deal with zero-frequency in our data, such as P(overcast | play = no) = 0/5 . To be on the safe the data miner should not state that a hypothesis or event could never occur , unless there is real scientific evidence that supports a zero probability. To solve the zero frequency we use a technique known as “Laplace estimation” , by adding a constant “m “ across all counts.