Computer Science: Algorithms & Data Science: Introduction to Data Mining & Machine Learning using WEKA Explorer and 1R Classifier

Data Mining & Machine Learning using WEKA Explorer

The 1R Classifier

The 1R (Holte’s 1R) classifier is a simple machine learning algorithm that works surprisingly well on standard data sets. Robert Holte developed 1R learning algorithm that competes very well with cutting edge algorithms. Robert Holte did not advocate the use of 1R algorithm versus other mainstream algorithms rather used the simple one classifier to demonstrate that many real word datasets did not embody very complex relations, therefore, do not warrant complicated learning algorithms. In order to, use the 1R classifier, one must prepare the data to deal with such issues as, contentious valued data, missing values, and greedy decisions, these topics are beyond the scope of this article.

Below are the general steps of 1R algorithm:

For each attribute a, form a rule as follows:

For each value v from the domain of a,

Select the set of instances where a has value v.

Let c be the most frequent class in that set.

Add the following clause to the rule for a:

if a has value v then the class is c

Finally, calculate the classification accuracy of this rule.

Note: Use the rule with the highest classification accuracy. The algorithm assumes that the attributes are

Discrete.

(Manning , Holmes & Ian, 1995).

Lets go over the required software & setup.

Installation & Setup

Software that you will need:

1) WEKA , required

2) JDK & JRE, required

3) Any Java IDE (Eclipse , Netbeans, etc...) , recommended

In order to download WEKA please go to the following URL (http://www.cs.waikato.ac.nz/ml/weka/index.html)

Setting up your environment

1) Please download the latest JDK (Java Development Kit) & JRE (Java Runtime) , as WEKA is written in Java.

JDK (http://www.oracle.com/technetwork/java/javase/downloads/index.html)

2) Make sure that your PATH & CLASSPATH system variables point to JRE & JDK’s bin directory.

Setting up your environment in Windows via GUI.

Go to Start , type “cmd” in the “Search programs and files” box Windows Command Prompt will appear

Checking Java

a) Check value of CLASSPATH by typing this command “echo %CLASSPATH%”

You should see all paths that are in CLASSPATH and should see your JDK’s bin folder

b) In order to check whether the JDK is setup properly and whether JDK is available for compile your JAVA source globally , please type this command “javac”

You should see all “javac” command options printed out

Open WEKA

1) Let’s check, if WEKA has been installed & running. Go Start, WEKA folder, such as (Weka 3.6.9) open the folder, click on WEKA icon. You should see a small GUI application open, click on Explorer button.

The Explorer should open

1) A brief explanation of the the WEKA Explorer panels.

· Preprocess tab, is where the data can be loaded. The panel will let one view,edit, and save the uploaded data.

· Classify tab, I where the classification algorithms can be accessed.

· Visualize tab, data can be viewed in graphical plots. Importing our example data set into WEKA.

2) The Iris data set is probably one of the most cited data sets in data mining, machine learning, and statistics studies, I obtained a copy from http://repository.seasr.org/Datasets/UCI/arff/iris.arff.

Data Set (Snippet of the Iris Data set)

@RELATION iris

@ATTRIBUTE sepallength REAL

@ATTRIBUTE sepalwidth REAL

@ATTRIBUTE petallength REAL

@ATTRIBUTE petalwidth REAL

@ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-virginica}

@DATA

5.1,3.5,1.4,0.2,Iris-setosa

4.9,3.0,1.4,0.2,Iris-setosa

4.7,3.2,1.3,0.2,Iris-setosa

4.6,3.1,1.5,0.2,Iris-setosa

Convert CSV data to “.arff” file (Optional Section)

Now most data in the world isn’t formatted for any format & the data miners or programmers have to prepare the data before using any tool. The format WEKA uses is “arff” , which is not a common format outside of the data mining world. Fortunately, WEKA provides Java code that will convert CSV (comma separated values) to “arff” format file

If your data isn’t in “arff” format follow this section & mini example, else skip it , and go Opening Opening The Data Set.

1) Since the data set is small , I’m going to paste it into Microsoft excel first.

(If the data set were very large one would gave write a program to convert the data’s format to CSV )

2) I manually pasted the data set below in Excel and , than go to File, Save As , Save as type option list, select CSV (Comma delimited), click Save, & click ok for all other warnings.

3) Also do “Find & Replace” all “|” to commas “,” than Save again.

As you start you becoming a data miner, or data analyst, or any field that deals with data you will realize that formatting the data , and ensuring data integrity is probably as hard as data mining and will consume a great deal of any “real world” project.

4) Fortunately, the WEKA people were wise enough to provide us with a way to convert CSV to “arff” format files. However a user unfriendly way.

Go to your Java IDE & create a simple Java project & paste the code below:

Reference : http://weka.wikispaces.com/Converting+CSV+to+ARFF

Program usage :

command line argument 0 “args[0]” is the path of CSV file you want to convert

Example : C:/Users/anaim/Desktop/Data Mining/OneR/Golf.csv

command line argument 1 “args[1]” is the path & name that you will like to save the .arff file

Example : C:/Users/anaim/Desktop/Data Mining/OneR/Golf.arff

Remember when passing paths in Java one most change backslashes “\” to forward “/” slashes.

import weka.core.Instances;

import weka.core.converters.ArffSaver;

import weka.core.converters.CSVLoader;

import java.io.File;

public class CSV2Arff {

/**

* takes 2 arguments:

* - CSV input file

* - ARFF output file

public static void main(String[] args) throws Exception {

if (args.length != 2) {

System.out.println("\nUsage: CSV2Arff <input.csv> <output.arff>\n");

System.exit(1);

}

// load CSV

CSVLoader loader = new CSVLoader();

loader.setSource(new File(args[0]));

Instances data = loader.getDataSet();

// save ARFF

ArffSaver saver = new ArffSaver();

saver.setInstances(data);

//I switched the order of the setDestination() call with setFile() as the original code had a bug

saver.setDestination(new File(args[1]));

saver.setFile(new File(args[1]));

saver.writeBatch();

}}

Make sure to add the “weka” jar to your projects build path , in order to be able to access WEKA CSV to arff capabilities used in the Java code above, see below for snapshot of this process:

Now if your using Eclipse right click on your project , click run configurations, select the “Arguments” tab, see the snapshot below:

Now run the program & generate the .arff file.

Opening Opening The Data Set

Finally , Go to the WEKA Explorer and select “Open file…” option & navigate to the path of your data set (Example Golf.arff).

You should see a screen similar to the one below , and also your attributes “sepallength, sepalwidth,humidity, petallength, petalwidth, and class” listed on the left.

Go to the “Classify” tab , and press the “Choose” button go to the “rules” folder and select “OneR” node. Now press “Start” button, you should get a screen similar to the one below:

Interpret this information, well below is snippet of the “Classifier output” window:

=== Classifier model (full training set) ===

petallength:

< 2.45 -> Iris-setosa

< 4.85 -> Iris-versicolor

>= 4.85 -> Iris-virginica

(143/150 instances correct)

=== Summary ===

1) Correctly Classified Instances 140 93.3333 %

2) Incorrectly Classified Instances 10 6.6667 %

3) Kappa statistic 0.9

4) Mean absolute error 0.0444

5) Root mean squared error 0.2108

6) Relative absolute error 10 %

7) Root relative squared error 44.7214 %

8) Total Number of Instances 150

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure ROC Area Class

1 0 1 1 1 1 Iris-setosa

0.88 0.04 0.917 0.88 0.898 0.92 Iris-versicolor

0.92 0.06 0.885 0.92 0.902 0.93 Iris-virginica

Weighted Avg. 0.933 0.033 0.934 0.933 0.933 0.95

=== Confusion Matrix ===

a b c <-- classified as

50 0 0 | a = Iris-setosa

0 44 6 | b = Iris-versicolor

0 4 46 | c = Iris-virginica

Interpretation of OneR results

The “Classifier model (full training set)” section shows the rules that were discovered, in order to predict the classes (Iris-setosa, Iris-versicolor, Iris-virginica).

The rules that were discovered are:

petallength:

< 2.45 -> Iris-setosa

< 4.85 -> Iris-versicolor

>= 4.85 -> Iris-virginica

We can interpret the above rules using a simple algorithm:

IF petallength <2.45 , RETURN Iris-setosa

Else if petallength <4.85, RETURN Iris-versicolor

ELSE RETURN Iris-virginica

In the “Summary” section the 1^st line states that 93.3% of the instances were correctly classified that is 140 out of 150 total instances, and the 2^nd line states the number of incorrectly classified instances which is 6.67% using the discovered rules.

Confusion Matrix -

There is nothing confusing about a “confusion matrix” , which uses tabular format to display performance results per class. Each class has its row and column , were actual class is the number in the row and the predicted class is the number in the column. For 2 Class prediction there is 2x2 matrix (not counting row/column labels) , see below:

		Predicted Class
		YES	NO
Actual Class	YES	True Positive (TP)	False Negative (FN)
Actual Class	NO	False Positive (FP)	True Negative (TN)

True positives and true negatives are correct classifications, false positives and false negatives are incorrect classifications.

The final success rate of your learning method is the number of correct classifications (TP + TN) divided by total number of all classifications ( TP +TN / TP+TN + FP+ FN) and the error rate is 1 – success rate.

In our example , we have 3 classes (Iris-setosa, Iris-versicolor, Iris-virginica) whoich will yield a 3x3 matrix (not counting row/column labels), see below:

=== Confusion Matrix ===

a b c <-- classified as

50 0 0 | a = Iris-setosa

0 44 6 | b = Iris-versicolor

0 4 46 | c = Iris-virginica

I reformatted the original WEKA text output to appear more visually understandable:

	a	b	c
a = Iris-setosa	50	0	0
b = Iris-versicolor	0	44	6
c = Iris-virginica	0	4	46

Now we can measure the success rate of the 1R classifier on the Iris data set , success rate = all numbers in the diagonal (true classifications / entire sum) = 50+44+46 / 50+44+46+4+6 = 150/160 = 93.75% success rate , and 6.75% error rate.

Kappa Statistic –

The Kappa Statistic is the percentage of possible predictions correctly classified by chance. The Kappa statistic is derived by using a random classifier.

Mean absolute error -

Is the average measure of how close predictions are to actual values.

Reference (http://en.wikipedia.org/wiki/Mean_absolute_error).

Root mean squared error -

Is the measure of how well linear formula (y= mx+b) or non-linear formula (polynomial curve or exponential curve) fits the actual data if you were to put them side by side on a regular graph. F(x) i the calculated values from your formula & “y” is the actual value, and “n” is the total number of data points.

Relative absolute error -

Given some value v and its approximation v_approx, the absolute error is

(Reference , http://en.wikipedia.org/wiki/Approximation_error).

Root relative squared error -

the relative error is

(Reference , http://en.wikipedia.org/wiki/Approximation_error).

References

1) Witten, I., Frank, E., & Hall, M. (2011). Data mining , practical machine learning tools and techniques. (3rd ed., Vol. 1, pp. 163-167). Burlington , MA: Morgan Kauffman.

2) Manning , N., Holmes, G., & Ian, H. (1995). The development of holte’s 1r classifier. Department of Computer Science,, Retrieved from http://www.cs.waikato.ac.nz/ml/publications/1995/Nevill-Manning95-1R.pdf

3) http://en.wikipedia.org/wiki/Cohen's_kappa

4) http://www.cis.temple.edu/~giorgio/cis587/readings/id3-c45.html

5) http://repository.seasr.org/Datasets/UCI/arff/

6) http://archive.ics.uci.edu/ml/

7) http://www.cs.waikato.ac.nz/ml/weka/

8) http://en.wikipedia.org/wiki/Mean_absolute_error

9) (Reference , http://en.wikipedia.org/wiki/Approximation_error).