Monday, October 14, 2013

Learning Decision Trees Using WEKA

1) Go to WEKA GUI Chooser and select explorer button
 

2)      Go UCI Data Mining repository and download the credit approval data set (http://archive.ics.uci.edu/ml/datasets/Credit+Approval) , click on the Data Folder hyperlink to get the data set & the Data Description to get overview of the attributes.


Some important information about the data is below:

Number of Instances: 690

Number of Attributes: 15 + class attribute

Attribute Information:
 

    A1: b, a.

    A2: continuous.

    A3: continuous.

    A4: u, y, l, t.

    A5: g, p, gg.

    A6: c, d, cc, i, j, k, m, r, q, w, x, e, aa, ff.

    A7: v, h, bb, j, n, z, dd, ff, o.

    A8: continuous.

    A9: t, f.

    A10:       t, f.

    A11:       continuous.

    A12:       t, f.

    A13:       g, p, s.

    A14:       continuous.

    A15:       continuous.

    A16: +,-         (class attribute)

 

4)      You will need to download the crx.data file  (http://archive.ics.uci.edu/ml/machine-learning-databases/credit-screening/crx.data) which contains credit related data , that we will use to build our model
 
Sub step a)    
  In your browser just save the “crx.data” file to a desired location , later using “Notepad++” or just Windows Notepad , change the extension of the file to “.csv” ,by using the “Save as” option.
Sub step b)    
  Open the crx.csv and paste the header information from crx_names , as the first row of the crx.data “a1,a2,a3,a4,a5,a6,a7,a8,a9,a10,a11,a12,a13,a14,a15,a16” , the real attribute names were removed to protect data gathering process of credit approval.
 Therefore the beginning of the file should look like this:

a1,a2,a3,a4,a5,a6,a7,a8,a9,a10,a11,a12,a13,a14,a15,a16
b,30.83,0,u,g,w,v,1.25,t,t,01,f,g,00202,0,+
a,58.67,4.46,u,g,q,h,3.04,t,t,06,f,g,00043,560,+
a,24.50,0.5,u,g,q,h,1.5,t,f,0,f,g,00280,824,+
b,27.83,1.54,u,g,w,v,3.75,t,t,05,t,g,00100,3,+
Etc…
 
Save again.
 
5)     Go to Explorer window, and select “Open file…” button. In the “Open” window , go to the “Files of type” drop down list and change the type to “CSV data files”
 
6)       Next select the Classify tab , Than “Choose” button , than  go to “Trees” folder  and select  J48 Classifier ( this is the open source version of the C4.5 decision tree construction algorithm).
 
 
Since we did not take the time to separate that data to “training” and  “test” sets , Use “10 fold validation” (10 fold cross validation), in order to get a reasonable measure for the accuracy of the algorithm. 
Click the “Start” button on the lower left, you see WEKA generate the results.
 
 
According to WEKA the algorithm has a 86.087 % accuracy in predicting the right class , and 13.913% error rate when predicting whether to approve a loan or not.
Right click on the result with timestamp, which is in the “Result list” text area and select “Visualize tree”.
 
You see this small and very crowded tree constructed, In order to make it viewable drag with mouse & expand the window ,than right click inside the window, than select “Fit to screen”.  See the visual representation of the decision tree below:


In conclusion , it seems the researchers who provided this data set to UCI might have been using real credit data , in order to derive an accurate modal , however to over protect the credit approval information they removed the actual attribute names and replaced them with a1-a16 attribute labels.  This does not reduce our learning we have a model now with about 86% accuracy , and this modal can be programed (using any programming language such as Java,C/C++, Python, etc…)  into existing software and used without exposing any client or confidential information.  It seems that the most important node in terms of "information gain" is node “a9” and that is the reason "a9" is the root node.
References
1)      Quinlan. (1987). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
2)      http://facweb.cs.depaul.edu/mobasher/classes/ect584/weka/classify.html
 

No comments:

Post a Comment