Thursday, October 3, 2013

Introduction to Data Mining & Machine Learning using WEKA Explorer and 1R Classifier


Data Mining & Machine Learning using WEKA Explorer
The 1R Classifier
The 1R (Holte’s 1R) classifier is a simple machine learning algorithm that works surprisingly well on standard data sets.  Robert Holte developed 1R learning algorithm that competes very well with cutting edge algorithms.  Robert Holte did not advocate the use of 1R algorithm versus other mainstream algorithms rather used the simple one classifier to demonstrate that many real word datasets did not embody very complex relations, therefore, do not warrant complicated learning algorithms. In order to, use the 1R classifier, one must prepare the data to deal with such issues as, contentious valued data, missing values, and greedy decisions, these topics are beyond the scope of this article.
Below are the general steps of 1R algorithm:
For each attribute a, form a rule as follows:
For each value v from the domain of a,
Select the set of instances where a has value v.
Let c be the most frequent class in that set.
Add the following clause to the rule for a:
if a has value v then the class is c
 
Finally, calculate the classification accuracy of this rule.
 
Note: Use the rule with the highest classification accuracy. The algorithm assumes that the attributes are
Discrete.
(Manning , Holmes & Ian, 1995).
Lets go over the required software & setup.
 
Installation & Setup

Software that you will need:


1)     WEKA , required

2)     JDK & JRE, required
3)     Any Java IDE (Eclipse , Netbeans, etc...) , recommended
In order to download WEKA please go to the following URL (http://www.cs.waikato.ac.nz/ml/weka/index.html)
Setting up your environment
1)     Please download the latest JDK (Java Development Kit) & JRE (Java Runtime) , as WEKA is written in Java.
JDK (http://www.oracle.com/technetwork/java/javase/downloads/index.html)
2)     Make sure that your PATH & CLASSPATH system variables point to JRE & JDK’s bin directory.

Setting up your environment in Windows via GUI.


Go to Start , type “cmd” in the “Search programs and files” box Windows Command Prompt will appear

Checking Java

a) Check value of CLASSPATH by typing this command “echo %CLASSPATH%”

You should see all paths that are in CLASSPATH and should see your JDK’s bin folder

 b) In order to check whether the JDK is setup properly and whether JDK is available for compile your JAVA source globally , please type this command “javac”

You should see  all “javac” command options printed out

Open WEKA

1)     Let’s check, if WEKA has been installed & running. Go Start, WEKA folder, such as (Weka 3.6.9) open the folder, click on WEKA icon. You should see a small GUI application open, click on Explorer button.
The Explorer should open

1)     A brief explanation of the the WEKA Explorer panels.
·         Preprocess tab, is where the data can be loaded. The panel will let one view,edit, and save  the uploaded data.
·         Classify tab, I where the classification algorithms can be accessed.
·         Visualize tab, data can be viewed in graphical plots. Importing our example data set into WEKA.  

2)     The Iris data set is probably one of the most cited data sets in data mining, machine learning, and statistics studies, I obtained a copy from http://repository.seasr.org/Datasets/UCI/arff/iris.arff.
 
Data Set (Snippet of the Iris Data set)
@RELATION iris
 
@ATTRIBUTE sepallength         REAL
@ATTRIBUTE sepalwidth          REAL
@ATTRIBUTE petallength         REAL
@ATTRIBUTE petalwidth          REAL
@ATTRIBUTE class       {Iris-setosa,Iris-versicolor,Iris-virginica}
 
@DATA
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa

4.6,3.1,1.5,0.2,Iris-setosa
Convert CSV data to “.arff” file (Optional Section)
Now most data in the world isn’t formatted for any format & the data miners or programmers have to prepare the data before using any tool. The format WEKA uses is “arff” , which is not a common format outside of the data mining world. Fortunately, WEKA provides Java code that will convert CSV (comma separated values) to “arff” format file
If your data isn’t in “arff” format follow this section & mini example, else skip it , and go Opening Opening The Data Set.
1)     Since the data set is small , I’m going to paste it into Microsoft excel first.
 
(If the data set were very large one would gave write a program to convert the data’s format to CSV )
2)     I manually pasted the data set below in Excel and , than go to File, Save As , Save as type option list, select CSV (Comma delimited), click Save, & click ok for all other warnings.
 

 
3)     Also do “Find & Replace” all “|” to commas “,” than Save again.
 
 
 
As you start you becoming a data miner, or data analyst, or any field that deals with data you will realize that formatting the data , and ensuring data integrity is probably as hard as data mining and will consume a great deal of any “real world” project.
 

4)     Fortunately, the WEKA people were wise enough to provide us with a way to convert CSV to “arff” format files. However a user unfriendly way.


Go to your Java IDE & create a simple Java project & paste the code below:

/*


Program usage :

command line argument 0 “args[0]” is the path of CSV file you want to convert

Example : C:/Users/anaim/Desktop/Data Mining/OneR/Golf.csv

command line argument 1 “args[1]” is the path & name that you will like to save the .arff file

Example : C:/Users/anaim/Desktop/Data Mining/OneR/Golf.arff

Remember when passing paths in Java one most change backslashes “\” to forward “/” slashes.

*/

import weka.core.Instances;

import weka.core.converters.ArffSaver;

import weka.core.converters.CSVLoader;

 

import java.io.File;

 

public class CSV2Arff {

  /**

   * takes 2 arguments:

   * - CSV input file

   * - ARFF output file

   */

  public static void main(String[] args) throws Exception {

    if (args.length != 2) {

      System.out.println("\nUsage: CSV2Arff <input.csv> <output.arff>\n");

      System.exit(1);

    }

 

    // load CSV

    CSVLoader loader = new CSVLoader();

    loader.setSource(new File(args[0]));

    Instances data = loader.getDataSet();

 

    // save ARFF

    ArffSaver saver = new ArffSaver();

    saver.setInstances(data);

    //I switched the order of the setDestination() call with setFile() as the          original code had a bug

    saver.setDestination(new File(args[1]));

    saver.setFile(new File(args[1]));

    saver.writeBatch();

  }}  

Make sure to add the “weka” jar to your projects build path , in order to be able to access WEKA CSV to arff capabilities used in the Java code above, see below for snapshot of this process:
 
 

 


Now if your using Eclipse right click on your project , click run configurations, select the “Arguments” tab, see the snapshot below:



 
  Now run the program & generate the .arff file.
 
Opening Opening The Data Set

Finally , Go to the WEKA Explorer and select “Open file…” option & navigate to the path of your data set (Example Golf.arff).

 
You should see a screen similar to the one below , and also your attributes “sepallength, sepalwidth,humidity, petallength, petalwidth, and class” listed on the left.

 
Go to the “Classify” tab , and press the “Choose” button go to the “rules” folder and select “OneR” node. Now press “Start” button, you should get a screen similar to the one below:
 

Interpret this information, well below is snippet of the “Classifier output” window:

=== Classifier model (full training set) ===
petallength:
                < 2.45   -> Iris-setosa
                < 4.85   -> Iris-versicolor
                >= 4.85 -> Iris-virginica
(143/150 instances correct)
=== Summary ===
1)     Correctly Classified Instances         140               93.3333 %
2)     Incorrectly Classified Instances        10                6.6667 %
3)     Kappa statistic                          0.9  
4)     Mean absolute error                      0.0444
5)     Root mean squared error                  0.2108
6)     Relative absolute error                 10      %
7)     Root relative squared error             44.7214 %
8)     Total Number of Instances              150    
=== Detailed Accuracy By Class ===
               TP Rate   FP Rate   Precision   Recall  F-Measure   ROC Area  Class
                 1         0          1         1         1          1        Iris-setosa
                 0.88      0.04       0.917     0.88      0.898      0.92     Iris-versicolor
                 0.92      0.06       0.885     0.92      0.902      0.93     Iris-virginica
Weighted Avg.    0.933     0.033      0.934     0.933     0.933      0.95
=== Confusion Matrix ===
  a  b  c   <-- classified as
 50  0  0 |  a = Iris-setosa
  0 44  6 |  b = Iris-versicolor
  0  4 46 |  c = Iris-virginica
Interpretation of OneR results
The “Classifier model (full training set)” section shows the rules that were discovered, in order to predict the classes (Iris-setosa, Iris-versicolor, Iris-virginica).
The rules that were discovered are:
petallength:
                < 2.45  -> Iris-setosa
                < 4.85  -> Iris-versicolor
                >= 4.85               -> Iris-virginica
We can interpret the above rules using a simple algorithm:
IF petallength <2.45 , RETURN Iris-setosa
Else if petallength <4.85, RETURN Iris-versicolor
ELSE  RETURN Iris-virginica
In the “Summary” section the 1st line states that 93.3% of the instances were correctly classified that is 140 out of 150 total instances, and the 2nd line states the number of incorrectly classified instances which is 6.67% using the discovered rules.
Confusion Matrix -
There is nothing confusing about a “confusion matrix” ,  which uses tabular format to display performance results per class. Each class has its row and column , were actual class is the number in the row and the predicted class is the number in the column. For 2 Class prediction there is 2x2 matrix (not counting row/column labels) ,  see below:
 


   
 
Predicted Class
 
 
YES
NO
Actual Class
YES
True Positive (TP)
False Negative (FN)
NO
False Positive (FP)
True Negative (TN)

 
True positives and true negatives are correct classifications, false positives and false negatives are incorrect classifications.

The final success rate of your learning method is the number of correct classifications (TP + TN) divided by total number of all classifications ( TP +TN /  TP+TN + FP+ FN) and the error rate is 1 – success rate.

In our example , we have 3 classes (Iris-setosa, Iris-versicolor, Iris-virginica) whoich will yield a 3x3 matrix (not counting row/column labels), see below:

=== Confusion Matrix ===

  a  b  c   <-- classified as

 50  0  0 |  a = Iris-setosa

  0 44  6 |  b = Iris-versicolor

  0  4 46 |  c = Iris-virginica

I reformatted the original WEKA text output to appear more visually understandable:

 


 
a
b
c
a = Iris-setosa
50
0
0
b = Iris-versicolor
0
44
6
c = Iris-virginica
0
4
46


Now we can measure the success rate of the 1R classifier on the Iris data set , success rate = all numbers in the diagonal (true classifications / entire sum) = 50+44+46 / 50+44+46+4+6 = 150/160 = 93.75% success rate , and 6.75% error rate.


Kappa Statistic –

The Kappa Statistic is the percentage of possible predictions correctly classified by chance. The Kappa statistic is derived by using a random classifier.

Mean absolute error  -               

Is the average measure of how close predictions are to actual values.



Reference (http://en.wikipedia.org/wiki/Mean_absolute_error).

Root mean squared error  -               

Is the measure of how well linear formula  (y= mx+b) or non-linear formula  (polynomial curve or exponential curve) fits the actual data if you were to put them side by side on a regular graph.  F(x) i the calculated values from your formula & “y” is the actual value, and “n” is the  total number of data points.

Relative absolute error  -

Given some value v and its approximation vapprox, the absolute error is
 

Root relative squared error   -

the relative error is
 
 

 



References

1)     Witten, I., Frank, E., & Hall, M. (2011). Data mining , practical machine learning tools and techniques. (3rd ed., Vol. 1, pp. 163-167). Burlington , MA: Morgan Kauffman.

2)     Manning , N., Holmes, G., & Ian, H. (1995). The development of holte’s 1r classifier. Department of Computer Science,, Retrieved from http://www.cs.waikato.ac.nz/ml/publications/1995/Nevill-Manning95-1R.pdf







 

 

No comments:

Post a Comment