Wednesday, October 9, 2013

Classification ,Numeric Prediction, Entropy & Creating Decision Trees


There are generally four different kinds of learning in data mining problems such as ,  classification learning, clustering, numeric prediction, and association learning.

Classification learning are learning methods that predict categorical classes (which are either discrete or nominal) based on prior set of classified examples (training set) .

Classification is referred to as supervised learning, because a set of understood data & class pairs (training set) , have been provided,  in order to derive a model.

Example of classification is:

{Employment status, Salary, credit history, criminal history, debt/asset ratio, age,} -> loan approval {YES/NO}

Clustering – data that are related or belong together are put into a related classes.

Example of clustering is below:

Diagram from (Classification and Prediction, A. Bellaachia)
 
 
Numeric prediction – a (continuous) numeric value is predicted via a mathematical model, such as linear regression “y=mx+b” , were y is the predicted numeric value and “mx+b” is the model.

Example of polynomial model is below:
(Diagram created by Aryan Naim , created using Octave , from UCI Machine Learning data repository)
Other terms that require formal definitions are Concept Description, Instances, and Attributes.
A Concept is the actual learning process, the algorithm or modal used.
A Concept description is the outcome of the learning process, basically the classifications, predictions, and relationships that have been discovered using the Concept.
The input data inside any model or the data that I suppose to be classified is known as Instances.
Each instance has Attributes such as fields, if instances are rows of table which represents  data , than attributes are the individual cells.  For example, a patient record for a particular visit to the doctor is the instance and the attributes can be blood pressure, weight , and symptoms.
 
Another example of attributes are:
If an instance is {Employment status, Salary, credit history, criminal history, debt/asset ratio, age} than the attributes are Employment status, Salary, credit history, criminal history, etc…
Attributes can be categorized as nominal, ordinal, interval, and ratio these are known as levels of measurement.  The methods of analysis used to derive knowledge from data are based on what levels of measurement the data set in question belongs to.
Nominal  attributes– Are the weakest level of measurement and categorize data , however do not rank the data.  
For example, nominal values in terms of weather can be windy, cloudy, sunny, humid, etc.. These are definite categories but are not quantifiable , sortable ,or can be compared.
Ordinal attribute– Is a stronger level of measurement compared to nominal. Ordinal data is sorted or ranked based on a criteria.
For example, Moody’s sovereign credit ratings are ordinal attributes such Aaa, Aa ,A, Baa, Ba, Caa, Ca, and C from highest quality debt to lowest quality debt.
Interval attribute – provide both ranking and the ability to measure the difference between to values.
For example interval attributes can be added or subtracted, such as temperature. 
Ratio attribute– Is the strongest type of measurement compared to the all other three mentioned above.
A ratio has interval measurement, as well as a true zero point of origin, and with ratio attributes one can meaningfully compare , add, subtract amounts within scales.
An example of ratio attribute is the liquidity ratio = assets / liabilities.
In this article we will assume that each instance will belong to a single classification, there are complex cases were an instance may belong to several classes, these instances are known as multi-labeled instances.
Classification and Numeric Prediction issues
·         Data Preparation (reduce noise ,handle missing values, handle extreme values, remove irrelevant values, quantization of continuous numeric values)
·         Compare accuracy of model to other possible models
·         Time it takes to use the model
·         Robustness (how does the model handle missing values and outliers in real world instances.
·         Compactness of the model – Is this most efficient model, or are there simpler models with the same success rate. 
Steps in Learning
Classification has general y 3 steps :
1)      Preprocess & Prepare data (reduce noise ,handle missing values, handle extreme values, remove irrelevant values, quantization of continuous numeric values).
2)      Model construction: Generate a model ( set of rules) based on training set .
A model can be set of classification rules with no order, decision trees , a mathematical model, Bayesian belief network, neural network, support vector machine, or even a hybrid model such as decision tree and mathematical model use together which is a regression tree.
3)      Test the Model : Input test data set in model , gather all outputs and measure success rate
 
Classification Learning using Decision Tree Induction (references Croce Danilo, Roberto Basili)
Let’s explain by example ,one way of constructing a decision tree, and step by step.
We want to decide whether to approve a loan for a list of clients that have applied for a loan. Below is table of 15 people.


Now we will like to construct a decision tree. A tree has a root (the initial attribute to make a decision or split on) , and leaves (other decision points).
How does an algorithm pick which attribute to pick as the root, and later which attributes to pick as leaves?
We have to use “Information Gain” methods such as entropy equations.
1)      First step in constructing a decision tree is picking which attribute {age,has_job,own_house, credit_rating, class} should be the root on the tree.
Let’s take a look at the entropy of the entire data set first , before trying to find the root & etc…
In terms of data science “Entropy is a measure of uncertainty associated with a random variable.”
Given a set of examples D is possible to compute the original entropy of the entire dataset such as:
 
 
Where C is the set of desired classes.
For example in our loan data set D :
·          there are a total of 15 instances
·         The loan approved column titled loan in either YES or NO
·         We will call YES positive, therefore Pr(Positive) = 9/15 = 0.6
·         We will call NO negarive , therefore Pr(Negative) = 6/15 = .4
Entropy (loan data set D) = -0.6 * log2 0.6 – 0.4 * log2 0.4 = 0.971
Now that we know the entropy of that the entire data set , we can used this information later in order to decide initially the root of the tree & later how to construct the rest of the tree.
2) Second, let’s look at the entropy of an individual attribute Ai
If we make attribute Ai, with v values, the root of the current tree, this will
partition D into v subsets D1;D2; : : : ;Dv . The expected entropy if Ai is used
as the current root:
 
In our data set we have the attributes {age,has_job,own_house, credit_rating} we need to calculate the entropy for each of these attributes.

Let’s start with the attribute “age” , age has three possible values {young,middle,old}, therefore per value

For Age = young and the 2 possible Class values  ={YES,NO} , the probabilities are:

Pr( NO | young ) = 3/5 = 0.6 , Pr(YES| young) = 2/5 = 0.4 according to the data set, and entropy of the attribute Age is calculated

Hyou [D] = 0.6 log2 0.6 – 0.4 log2 0.4 = 0.971

For Age = middle and the 2 possible Class values ={YES,NO}, the probabilities are:

Pr ( NO | middle) = 3/5 = 0.6 , Pr ( YES | middle) = 2/5 = 0.4 

Hmiddle[D] = -0.6 log2 0.6 – 0.4 log2 0.4 = 0.971

For Age = old and the 2 possible Class values = {YES, NO} , the probabilities are:

Pr ( NO | old ) = 1/5= 0.20 , Pr  (YES, old) =4/5 = 0.8

Hold [D] = -0.2 log2 0.2 – 0.8 log2 0.8 = 0.722

Sum all the entropy values  Hage [D] + Hmiddle[D]+ Hold [D] / number values possible values = 0.971+0.971+0.722/3 possible values = 0.888
 
0.888 units is the entropy for the Age attribute in the entire data set.

Repeat this entropy calculation for all remaining attributes {has_job,own_house, credit_rating}

33) Now calculate the information gain per each attribute {age,has_job,own_house, credit_rating} by subtracting the total entropy of the data set D from step1 entropy (D) = .971 – from the entropy values of each attribute, finally selected the maximum result.

Information gain formula:



 

Below is the calculated gain for each attribute {age,has_job,own_house, credit_rating} :

Gain (D,age) = .971 - .888 = 0.083

Gain (D,own_house) = 0.971 - .551 = 0.420

Gain (D, Has_job) = 0.971 – 0.647 = 0.324

Gain (D, Credit) = 0.971 – 0.608 = 0.363

 

The maximum information gain is Gain (D,own_house) = 0.420 , therefore the root of the tree will be the “own_house” attribute.

 

4)      Continue the same steps 2 to 3 on the remaining attributes {age,has_job, credit_rating} in order to construct the tree until no other attribute is left.
 
In the next tutorial we will construct a decision tree using WEKA.  

References
 
1. http://www.seas.gwu.edu/~bell/csci243/lectures/classification.pdf
2."Decision tree algorithm short Weka tutorial" Croce Danilo, Roberto Basili
 



1 comment:

  1. Thanks for sharing excellent informations. Your web-site is very cool. I'm impressed by the details that you have on this web site. It reveals how nicely you perceive this subject. Bookmarked this web page, will come back for extra articles. You, my pal, ROCK! I found simply the information I already searched all over the place and simply couldn't come across. What a perfect web-site. business intelligence analytics and data science a managerial perspective

    ReplyDelete