1) Go
to WEKA GUI Chooser and select explorer button
2)
Go UCI Data Mining repository and download the
credit approval data set (http://archive.ics.uci.edu/ml/datasets/Credit+Approval)
, click on the Data Folder hyperlink to get the data set & the Data
Description to get overview of the attributes.
3)
According the Data Description file (http://archive.ics.uci.edu/ml/machine-learning-databases/credit-screening/crx.names)
Some important information about the data is
below:
Number
of Instances: 690
Number
of Attributes: 15 + class attribute
Attribute
Information:
A1: b,
a.
A2: continuous.
A3: continuous.
A4: u,
y, l, t.
A5: g,
p, gg.
A6: c,
d, cc, i, j, k, m, r, q, w, x, e, aa, ff.
A7: v,
h, bb, j, n, z, dd, ff, o.
A8: continuous.
A9: t,
f.
A10: t,
f.
A11: continuous.
A12: t,
f.
A13: g,
p, s.
A14: continuous.
A15: continuous.
A16: +,- (class attribute)
4)
You will need to download the crx.data file (http://archive.ics.uci.edu/ml/machine-learning-databases/credit-screening/crx.data)
which contains credit related data , that we will use to build our model
Sub step a)
In your browser just save the “crx.data” file to
a desired location , later using “Notepad++” or just Windows Notepad , change
the extension of the file to “.csv” ,by using the “Save as” option.
Sub step b)
Open the crx.csv and paste the header
information from crx_names , as the first row of the crx.data “a1,a2,a3,a4,a5,a6,a7,a8,a9,a10,a11,a12,a13,a14,a15,a16” , the real attribute names were removed to protect data gathering process of credit approval.
a1,a2,a3,a4,a5,a6,a7,a8,a9,a10,a11,a12,a13,a14,a15,a16
b,30.83,0,u,g,w,v,1.25,t,t,01,f,g,00202,0,+
a,58.67,4.46,u,g,q,h,3.04,t,t,06,f,g,00043,560,+
a,24.50,0.5,u,g,q,h,1.5,t,f,0,f,g,00280,824,+
b,27.83,1.54,u,g,w,v,3.75,t,t,05,t,g,00100,3,+
Etc…
Save again.
5)
Go to Explorer window, and select “Open file…”
button. In the “Open” window , go to the “Files of type” drop down list and
change the type to “CSV data files”
6)
Next select the
Classify tab , Than “Choose” button , than go to “Trees” folder and select J48 Classifier ( this is the open source
version of the C4.5 decision tree construction algorithm).
Since we did not take the time to separate that data to
“training” and “test” sets , Use “10
fold validation” (10 fold cross validation), in order to get a reasonable
measure for the accuracy of the algorithm.
Click the “Start” button on the lower left, you see WEKA
generate the results.
According to WEKA the
algorithm has a 86.087 % accuracy in predicting the right class , and 13.913%
error rate when predicting whether to approve a loan or not.
Right click on the result with timestamp, which is in the
“Result list” text area and select “Visualize tree”.
You see this small and very crowded tree constructed, In
order to make it viewable drag with mouse & expand the window ,than right
click inside the window, than select “Fit to screen”. See the visual representation of the decision
tree below:
In conclusion , it seems the researchers who provided this
data set to UCI might have been using real credit data , in order to derive an accurate
modal , however to over protect the credit approval information they removed
the actual attribute names and replaced them with a1-a16 attribute labels. This does not reduce our learning we have a
model now with about 86% accuracy , and this modal can be programed (using any
programming language such as Java,C/C++, Python, etc…) into existing software and used without
exposing any client or confidential information. It seems that the most important node in terms of "information gain" is node “a9” and that is the reason "a9" is the root node.
References
1)
Quinlan. (1987). UCI Machine Learning Repository
[http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School
of Information and Computer Science.
2)
http://facweb.cs.depaul.edu/mobasher/classes/ect584/weka/classify.html
No comments:
Post a Comment