Thursday, October 31, 2013

Bayesian Learning , The Naïve Bayes Model


Bayesian Learning , The Naïve Bayes Model

Naïve Bayes is a simple probabilistic classifier based on applying Bayes theorem with assumption of independence between features.

 Explanation of Baye’s rule:

Baye’s rule states that p(h|e) = p(e|h) * p(h) / p(e)


The probability of a hypothesis or event (h) occurring can be predicted based on evidences (e) that can be observed from prior events.

Some important terms are:

Priori probability– The probability of p(h) event before the evidence is observed.

Posterior probability – The probability of p(h|e) event after evidence is observed.

Naïve Bays can be used in such areas as document classification (spam filtering, website classification, etc..) and also for events who’ s features can be safely assumed independent or that establishing dependence with be too costly.

Let H be the event of “fever” and E the evidence of “sore throat” , then we have

P( fever | sore throat) = P( sore throat | fever) * P (fever) / P (dark cloud )

P( sore throat | fever ) this is the probability that the person has “sore throat” given or during a “fever”.

P(fever) – is the Prior probability. This can be obtained from statistical medical records such as number of people who had a fever when visiting the doctor this year.

P( sore throat) is probability of evidence occurring , This can be obtained from statistical medical records such as number of people who had a sore throats  when visiting the doctor this year.

 
As one can observe from the above example, we can predict an outcome of some events by observing some evidences; the more evidences the better prediction.  When including more evidences for building our NB model, we could run into a problem of dependencies. For example including the evidence “excessive coughing” might be due to “sore throat” can make the model complicated.  Therefore we assume that all evidences are “independent” of each other thus “naïve”.

Bayes rule for multiple evidences with independence 

P(H | E1, E2, ..., En) = P( E1 | H) x P( E2 | H) x ... P( En | H) x P(H)

P(E1, E2, ..., En)


Example 1: Lets try and build an NB model. Using weather example from the book “Data Mining , Practical Machine Learning Tools & Techniques”


Predict if the team will “play” given the features “outlook, temperature, humidity, and windy”

 


Lets build frequency table of different evidences per feature & classification outcome.
 


The above table , tabulates all the data in one place in order to make comparison a little easier each feature value such as Outlook = sunny has frequency of Yes versus no No classification. The bottom portion has the relative frequency expressed in fractions, such as p( Outlook=sunny  |Play = yes)  = 2/9 p( Outlook= sunny | Play = no ) =3/5.

Now that we have created the NB model via the table above we can utilize this model to predict the likelihood event “play” based on different evidence values. For example ,

P [ yes | outlook = rainy , temperature = cool, humidity = high, windy = false] =

P[rainy | yes] * P [mild | yes] *  P  [high | yes] * P[ false | yes] * P [yes] / P [rainy*cool*high*false] =

 3/9 * 4/9 * 3/9 * 6/ 9 * 9/14, we can ignore P[Evidences] = 0.02116448

Also likelihood of :

P [ no | outlook = rainy , temperature = cool, humidity = high, windy = false] = 3/9 * 4/9 * 3/9 * 6/ 9 * 5/14 = 0.01175778

Finally will convert the above likelihood calculations into probability by normalization

P[yes] =  0.02116 / ( 0.01175 + 0.02116) = 0.6431 or about 64.31 % chance of rain given the evidences

P[No]= 0.01175 / ( 0.01175 + 0.02116) =  0.3568 or about 36.68% chance of no rain given evidences

How to deal with zero-frequency in our data, such as P(overcast | play = no) =  0/5 . To be on the safe the data miner should not state that a hypothesis or event could never occur , unless there is real scientific evidence that supports a zero probability. To solve the zero frequency we use a technique known as “Laplace estimation” , by adding a constant “m “ across all counts.


 

Above explanation of Laplace estimation was from Haruechaiyasak, Choochart. "A Tutorial on Naive Bayes Classification.".

Next article we will discuss how to build a Naïve Nayes model in WEKA explorer.

References


2.      -Haruechaiyasak, Choochart. "A Tutorial on Naive Bayes Classification." . N.p., 16 08 2008. Web. 31 Oct 2013.

3.      Jurafsky, Dan. " Text Classification and Naïve Bayes." . Stanford University . Web. 31 Oct 2013.

 

Monday, October 14, 2013

Learning Decision Trees Using WEKA

1) Go to WEKA GUI Chooser and select explorer button
 

2)      Go UCI Data Mining repository and download the credit approval data set (http://archive.ics.uci.edu/ml/datasets/Credit+Approval) , click on the Data Folder hyperlink to get the data set & the Data Description to get overview of the attributes.


Some important information about the data is below:

Number of Instances: 690

Number of Attributes: 15 + class attribute

Attribute Information:
 

    A1: b, a.

    A2: continuous.

    A3: continuous.

    A4: u, y, l, t.

    A5: g, p, gg.

    A6: c, d, cc, i, j, k, m, r, q, w, x, e, aa, ff.

    A7: v, h, bb, j, n, z, dd, ff, o.

    A8: continuous.

    A9: t, f.

    A10:       t, f.

    A11:       continuous.

    A12:       t, f.

    A13:       g, p, s.

    A14:       continuous.

    A15:       continuous.

    A16: +,-         (class attribute)

 

4)      You will need to download the crx.data file  (http://archive.ics.uci.edu/ml/machine-learning-databases/credit-screening/crx.data) which contains credit related data , that we will use to build our model
 
Sub step a)    
  In your browser just save the “crx.data” file to a desired location , later using “Notepad++” or just Windows Notepad , change the extension of the file to “.csv” ,by using the “Save as” option.
Sub step b)    
  Open the crx.csv and paste the header information from crx_names , as the first row of the crx.data “a1,a2,a3,a4,a5,a6,a7,a8,a9,a10,a11,a12,a13,a14,a15,a16” , the real attribute names were removed to protect data gathering process of credit approval.
 Therefore the beginning of the file should look like this:

a1,a2,a3,a4,a5,a6,a7,a8,a9,a10,a11,a12,a13,a14,a15,a16
b,30.83,0,u,g,w,v,1.25,t,t,01,f,g,00202,0,+
a,58.67,4.46,u,g,q,h,3.04,t,t,06,f,g,00043,560,+
a,24.50,0.5,u,g,q,h,1.5,t,f,0,f,g,00280,824,+
b,27.83,1.54,u,g,w,v,3.75,t,t,05,t,g,00100,3,+
Etc…
 
Save again.
 
5)     Go to Explorer window, and select “Open file…” button. In the “Open” window , go to the “Files of type” drop down list and change the type to “CSV data files”
 
6)       Next select the Classify tab , Than “Choose” button , than  go to “Trees” folder  and select  J48 Classifier ( this is the open source version of the C4.5 decision tree construction algorithm).
 
 
Since we did not take the time to separate that data to “training” and  “test” sets , Use “10 fold validation” (10 fold cross validation), in order to get a reasonable measure for the accuracy of the algorithm. 
Click the “Start” button on the lower left, you see WEKA generate the results.
 
 
According to WEKA the algorithm has a 86.087 % accuracy in predicting the right class , and 13.913% error rate when predicting whether to approve a loan or not.
Right click on the result with timestamp, which is in the “Result list” text area and select “Visualize tree”.
 
You see this small and very crowded tree constructed, In order to make it viewable drag with mouse & expand the window ,than right click inside the window, than select “Fit to screen”.  See the visual representation of the decision tree below:


In conclusion , it seems the researchers who provided this data set to UCI might have been using real credit data , in order to derive an accurate modal , however to over protect the credit approval information they removed the actual attribute names and replaced them with a1-a16 attribute labels.  This does not reduce our learning we have a model now with about 86% accuracy , and this modal can be programed (using any programming language such as Java,C/C++, Python, etc…)  into existing software and used without exposing any client or confidential information.  It seems that the most important node in terms of "information gain" is node “a9” and that is the reason "a9" is the root node.
References
1)      Quinlan. (1987). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
2)      http://facweb.cs.depaul.edu/mobasher/classes/ect584/weka/classify.html
 

Wednesday, October 9, 2013

Classification ,Numeric Prediction, Entropy & Creating Decision Trees


There are generally four different kinds of learning in data mining problems such as ,  classification learning, clustering, numeric prediction, and association learning.

Classification learning are learning methods that predict categorical classes (which are either discrete or nominal) based on prior set of classified examples (training set) .

Classification is referred to as supervised learning, because a set of understood data & class pairs (training set) , have been provided,  in order to derive a model.

Example of classification is:

{Employment status, Salary, credit history, criminal history, debt/asset ratio, age,} -> loan approval {YES/NO}

Clustering – data that are related or belong together are put into a related classes.

Example of clustering is below:

Diagram from (Classification and Prediction, A. Bellaachia)
 
 
Numeric prediction – a (continuous) numeric value is predicted via a mathematical model, such as linear regression “y=mx+b” , were y is the predicted numeric value and “mx+b” is the model.

Example of polynomial model is below:
(Diagram created by Aryan Naim , created using Octave , from UCI Machine Learning data repository)
Other terms that require formal definitions are Concept Description, Instances, and Attributes.
A Concept is the actual learning process, the algorithm or modal used.
A Concept description is the outcome of the learning process, basically the classifications, predictions, and relationships that have been discovered using the Concept.
The input data inside any model or the data that I suppose to be classified is known as Instances.
Each instance has Attributes such as fields, if instances are rows of table which represents  data , than attributes are the individual cells.  For example, a patient record for a particular visit to the doctor is the instance and the attributes can be blood pressure, weight , and symptoms.
 
Another example of attributes are:
If an instance is {Employment status, Salary, credit history, criminal history, debt/asset ratio, age} than the attributes are Employment status, Salary, credit history, criminal history, etc…
Attributes can be categorized as nominal, ordinal, interval, and ratio these are known as levels of measurement.  The methods of analysis used to derive knowledge from data are based on what levels of measurement the data set in question belongs to.
Nominal  attributes– Are the weakest level of measurement and categorize data , however do not rank the data.  
For example, nominal values in terms of weather can be windy, cloudy, sunny, humid, etc.. These are definite categories but are not quantifiable , sortable ,or can be compared.
Ordinal attribute– Is a stronger level of measurement compared to nominal. Ordinal data is sorted or ranked based on a criteria.
For example, Moody’s sovereign credit ratings are ordinal attributes such Aaa, Aa ,A, Baa, Ba, Caa, Ca, and C from highest quality debt to lowest quality debt.
Interval attribute – provide both ranking and the ability to measure the difference between to values.
For example interval attributes can be added or subtracted, such as temperature. 
Ratio attribute– Is the strongest type of measurement compared to the all other three mentioned above.
A ratio has interval measurement, as well as a true zero point of origin, and with ratio attributes one can meaningfully compare , add, subtract amounts within scales.
An example of ratio attribute is the liquidity ratio = assets / liabilities.
In this article we will assume that each instance will belong to a single classification, there are complex cases were an instance may belong to several classes, these instances are known as multi-labeled instances.
Classification and Numeric Prediction issues
·         Data Preparation (reduce noise ,handle missing values, handle extreme values, remove irrelevant values, quantization of continuous numeric values)
·         Compare accuracy of model to other possible models
·         Time it takes to use the model
·         Robustness (how does the model handle missing values and outliers in real world instances.
·         Compactness of the model – Is this most efficient model, or are there simpler models with the same success rate. 
Steps in Learning
Classification has general y 3 steps :
1)      Preprocess & Prepare data (reduce noise ,handle missing values, handle extreme values, remove irrelevant values, quantization of continuous numeric values).
2)      Model construction: Generate a model ( set of rules) based on training set .
A model can be set of classification rules with no order, decision trees , a mathematical model, Bayesian belief network, neural network, support vector machine, or even a hybrid model such as decision tree and mathematical model use together which is a regression tree.
3)      Test the Model : Input test data set in model , gather all outputs and measure success rate
 
Classification Learning using Decision Tree Induction (references Croce Danilo, Roberto Basili)
Let’s explain by example ,one way of constructing a decision tree, and step by step.
We want to decide whether to approve a loan for a list of clients that have applied for a loan. Below is table of 15 people.


Now we will like to construct a decision tree. A tree has a root (the initial attribute to make a decision or split on) , and leaves (other decision points).
How does an algorithm pick which attribute to pick as the root, and later which attributes to pick as leaves?
We have to use “Information Gain” methods such as entropy equations.
1)      First step in constructing a decision tree is picking which attribute {age,has_job,own_house, credit_rating, class} should be the root on the tree.
Let’s take a look at the entropy of the entire data set first , before trying to find the root & etc…
In terms of data science “Entropy is a measure of uncertainty associated with a random variable.”
Given a set of examples D is possible to compute the original entropy of the entire dataset such as:
 
 
Where C is the set of desired classes.
For example in our loan data set D :
·          there are a total of 15 instances
·         The loan approved column titled loan in either YES or NO
·         We will call YES positive, therefore Pr(Positive) = 9/15 = 0.6
·         We will call NO negarive , therefore Pr(Negative) = 6/15 = .4
Entropy (loan data set D) = -0.6 * log2 0.6 – 0.4 * log2 0.4 = 0.971
Now that we know the entropy of that the entire data set , we can used this information later in order to decide initially the root of the tree & later how to construct the rest of the tree.
2) Second, let’s look at the entropy of an individual attribute Ai
If we make attribute Ai, with v values, the root of the current tree, this will
partition D into v subsets D1;D2; : : : ;Dv . The expected entropy if Ai is used
as the current root:
 
In our data set we have the attributes {age,has_job,own_house, credit_rating} we need to calculate the entropy for each of these attributes.

Let’s start with the attribute “age” , age has three possible values {young,middle,old}, therefore per value

For Age = young and the 2 possible Class values  ={YES,NO} , the probabilities are:

Pr( NO | young ) = 3/5 = 0.6 , Pr(YES| young) = 2/5 = 0.4 according to the data set, and entropy of the attribute Age is calculated

Hyou [D] = 0.6 log2 0.6 – 0.4 log2 0.4 = 0.971

For Age = middle and the 2 possible Class values ={YES,NO}, the probabilities are:

Pr ( NO | middle) = 3/5 = 0.6 , Pr ( YES | middle) = 2/5 = 0.4 

Hmiddle[D] = -0.6 log2 0.6 – 0.4 log2 0.4 = 0.971

For Age = old and the 2 possible Class values = {YES, NO} , the probabilities are:

Pr ( NO | old ) = 1/5= 0.20 , Pr  (YES, old) =4/5 = 0.8

Hold [D] = -0.2 log2 0.2 – 0.8 log2 0.8 = 0.722

Sum all the entropy values  Hage [D] + Hmiddle[D]+ Hold [D] / number values possible values = 0.971+0.971+0.722/3 possible values = 0.888
 
0.888 units is the entropy for the Age attribute in the entire data set.

Repeat this entropy calculation for all remaining attributes {has_job,own_house, credit_rating}

33) Now calculate the information gain per each attribute {age,has_job,own_house, credit_rating} by subtracting the total entropy of the data set D from step1 entropy (D) = .971 – from the entropy values of each attribute, finally selected the maximum result.

Information gain formula:



 

Below is the calculated gain for each attribute {age,has_job,own_house, credit_rating} :

Gain (D,age) = .971 - .888 = 0.083

Gain (D,own_house) = 0.971 - .551 = 0.420

Gain (D, Has_job) = 0.971 – 0.647 = 0.324

Gain (D, Credit) = 0.971 – 0.608 = 0.363

 

The maximum information gain is Gain (D,own_house) = 0.420 , therefore the root of the tree will be the “own_house” attribute.

 

4)      Continue the same steps 2 to 3 on the remaining attributes {age,has_job, credit_rating} in order to construct the tree until no other attribute is left.
 
In the next tutorial we will construct a decision tree using WEKA.  

References
 
1. http://www.seas.gwu.edu/~bell/csci243/lectures/classification.pdf
2."Decision tree algorithm short Weka tutorial" Croce Danilo, Roberto Basili