There are
generally four different kinds of learning in data mining problems such as , classification
learning, clustering, numeric prediction, and association learning.
Classification learning – are learning methods
that predict categorical classes (which are either discrete or nominal) based
on prior set of classified examples (training set) .
Classification is referred to as supervised learning, because a set of understood data & class pairs (training set) , have been provided, in order to derive a model.
Classification is referred to as supervised learning, because a set of understood data & class pairs (training set) , have been provided, in order to derive a model.
Example of classification is:
{Employment
status, Salary, credit history, criminal history, debt/asset ratio, age,} ->
loan approval {YES/NO}
Clustering – data that are related or
belong together are put into a related classes.
Example of clustering is below:
Diagram from (Classification and Prediction, A. Bellaachia)
Numeric prediction – a (continuous)
numeric value is predicted via a mathematical model, such as linear regression
“y=mx+b” , were y is the predicted numeric value and “mx+b” is the model.
Example of polynomial model is below:
(Diagram
created by Aryan Naim , created using Octave , from UCI Machine Learning data
repository)
Other terms
that require formal definitions are Concept
Description, Instances, and Attributes.
A Concept is the actual learning process,
the algorithm or modal used.
A Concept description is the outcome of
the learning process, basically the classifications, predictions, and
relationships that have been discovered using the Concept.
The input data
inside any model or the data that I suppose to be classified is known as Instances.
Each instance
has Attributes such as fields, if
instances are rows of table which represents
data , than attributes are the individual cells. For example, a patient record for a
particular visit to the doctor is the instance and the attributes can be blood
pressure, weight , and symptoms.
Another example of attributes are:
If an instance
is {Employment status, Salary, credit history, criminal history, debt/asset
ratio, age} than the attributes are Employment status, Salary, credit history,
criminal history, etc…
Attributes can
be categorized as nominal, ordinal, interval, and ratio these are known as
levels of measurement. The methods of
analysis used to derive knowledge from data are based on what levels of
measurement the data set in question belongs to.
Nominal attributes– Are the weakest level of
measurement and categorize data , however do not rank the data.
For example,
nominal values in terms of weather can be windy, cloudy, sunny, humid, etc..
These are definite categories but are not quantifiable , sortable ,or can be
compared.
Ordinal attribute– Is a stronger level
of measurement compared to nominal. Ordinal data is sorted or ranked based on a
criteria.
For example,
Moody’s sovereign credit ratings are ordinal attributes such Aaa, Aa ,A, Baa, Ba,
Caa, Ca, and C from highest quality debt to lowest quality debt.
Interval attribute – provide both
ranking and the ability to measure the difference between to values.
For example interval
attributes can be added or subtracted, such as temperature.
Ratio attribute– Is the strongest type
of measurement compared to the all other three mentioned above.
A ratio has
interval measurement, as well as a true zero point of origin, and with ratio
attributes one can meaningfully compare , add, subtract amounts within scales.
An example of
ratio attribute is the liquidity ratio = assets / liabilities.
In this article
we will assume that each instance will belong to a single classification, there
are complex cases were an instance may belong to several classes, these
instances are known as multi-labeled
instances.
Classification and Numeric Prediction
issues
·
Data Preparation (reduce noise ,handle missing
values, handle extreme values, remove irrelevant values, quantization of
continuous numeric values)
·
Compare accuracy of model to other possible
models
·
Time it takes to use the model
·
Robustness (how does the model handle missing
values and outliers in real world instances.
·
Compactness of the model – Is this most
efficient model, or are there simpler models with the same success rate.
Steps in Learning
Classification
has general y 3 steps :
1)
Preprocess
& Prepare data (reduce noise ,handle missing values, handle extreme
values, remove irrelevant values, quantization of continuous numeric values).
2)
Model construction:
Generate a model ( set of rules) based on training set .
A
model can be set of classification rules with no order, decision trees , a
mathematical model, Bayesian belief network, neural network, support vector
machine, or even a hybrid model such as decision tree and mathematical model use
together which is a regression tree.
3)
Test the
Model : Input test data set in model , gather all outputs and measure
success rate
Classification Learning using Decision
Tree Induction (references Croce Danilo, Roberto Basili)
Let’s explain by
example ,one way of constructing a decision tree, and step by step.
We want to
decide whether to approve a loan for a list of clients that have applied for a
loan. Below is table of 15 people.
How does an
algorithm pick which attribute to pick as the root, and later which attributes
to pick as leaves?
We have to use
“Information Gain” methods such as entropy equations.
1)
First step in constructing a decision tree is
picking which attribute {age,has_job,own_house, credit_rating, class} should be
the root on the tree.
Let’s take a
look at the entropy of the entire data set first , before trying to find the
root & etc…
In terms of
data science “Entropy is a measure of uncertainty associated with a random
variable.”
Given a set of
examples D is possible to compute the original entropy of the entire dataset
such as:
Where C is the
set of desired classes.
For example in
our loan data set D :
·
there are
a total of 15 instances
·
The loan approved column titled loan in either
YES or NO
·
We will call YES positive, therefore
Pr(Positive) = 9/15 = 0.6
·
We will call NO negarive , therefore
Pr(Negative) = 6/15 = .4
Entropy (loan data set D) = -0.6 * log2
0.6 – 0.4 * log2 0.4 = 0.971
Now that we know the entropy of that the entire data set ,
we can used this information later in order to decide initially the root of the
tree & later how to construct the rest of the tree.
2) Second, let’s look at the entropy of an
individual attribute Ai
If
we make attribute Ai, with v values, the root of the current tree, this will
partition
D into
v subsets
D1;D2; : : : ;Dv .
The expected entropy if Ai
is used
as the current root:
In
our data set we have the attributes {age,has_job,own_house, credit_rating} we
need to calculate the entropy for each of these attributes.
Let’s
start with the attribute “age” , age has three possible values
{young,middle,old}, therefore per value
For
Age = young and the 2 possible Class values ={YES,NO} , the probabilities
are:
Pr(
NO | young ) = 3/5 = 0.6 , Pr(YES| young) = 2/5 = 0.4 according to the data
set, and entropy of the attribute Age is calculated
Hyou [D] = 0.6 log2 0.6 – 0.4 log2 0.4 = 0.971
For
Age = middle and the 2 possible Class values ={YES,NO}, the probabilities are:
Pr
( NO | middle) = 3/5 = 0.6 , Pr ( YES | middle) = 2/5 = 0.4
Hmiddle[D] = -0.6 log2 0.6 – 0.4 log2 0.4 = 0.971
For
Age = old and the 2 possible Class values = {YES, NO} , the probabilities are:
Pr
( NO | old ) = 1/5= 0.20 , Pr (YES, old) =4/5 = 0.8
Hold [D] = -0.2 log2 0.2 – 0.8 log2 0.8 = 0.722
Sum
all the entropy values Hage [D] + Hmiddle[D]+ Hold [D] / number values possible values = 0.971+0.971+0.722/3 possible values = 0.888
0.888 units is the entropy for the Age attribute in the entire data set.
Repeat
this entropy calculation for all remaining attributes {has_job,own_house,
credit_rating}
33) Now calculate the information
gain per each attribute {age,has_job,own_house, credit_rating} by subtracting
the total entropy of the data set D from step1 entropy (D) = .971 – from the
entropy values of each attribute, finally selected the maximum result.
Information
gain formula:
Below
is the calculated gain for each attribute {age,has_job,own_house,
credit_rating} :
Gain (D,age) = .971 - .888 = 0.083
Gain (D,own_house) = 0.971 - .551 = 0.420
Gain (D, Has_job) = 0.971 – 0.647 = 0.324
Gain (D, Credit) = 0.971 – 0.608 = 0.363
The
maximum information gain is Gain
(D,own_house) = 0.420 , therefore the root of the tree will be the “own_house”
attribute.
4)
Continue the same steps 2 to 3 on the remaining
attributes {age,has_job, credit_rating} in order to construct the tree until no
other attribute is left.
In the next tutorial we will construct a decision tree using WEKA.
References
1. http://www.seas.gwu.edu/~bell/csci243/lectures/classification.pdf
2."Decision tree algorithm short Weka tutorial" Croce Danilo, Roberto Basili
Thanks for sharing excellent informations. Your web-site is very cool. I'm impressed by the details that you have on this web site. It reveals how nicely you perceive this subject. Bookmarked this web page, will come back for extra articles. You, my pal, ROCK! I found simply the information I already searched all over the place and simply couldn't come across. What a perfect web-site. business intelligence analytics and data science a managerial perspective
ReplyDelete