Machine Learning Weka

Task

Perform investigation of algorithms for generating different rules and decision trees using Weka software Formulate generalizing conclusions

Goals

  • Generate or collect training data
  • The number of attributes is 5-10, the number of data records is 10-100000
  • 6 Algorithms to be tested

Introduction

Weka is a data mining software that supports a large number of machine learning algorithms. The downside is that it can be overwhelming to know which algorithms to use, and when. However, testing through them helps to get used for their non-standard names that may not be familiar and to learn more about solving the problem and getting closer to discover the few algorithms that perform best in our context.

DATASET

The data set I’m using is Student Alcohol Consumption from kaggle.com where the data were obtained in a survey of students math courses in secondary school. It contains social, gender and study information about students https://www.kaggle.com/uciml/student-alcohol-consumption/version/1#student-mat.csv

To make data understandable by Weka, the dataset is formated to Weka understandable format ARFF which has a list of the attributes (the columns in the data), and their types. More about ARFF format https://www.cs.waikato.ac.nz/ml/weka/arff.html

Weka Machine Learning Algorithms

Various different algorithms are found under classify tab of the Explorer and divided into main groups: bytes, function, lazy, meta and etc. And provides more information about each machine learning algorithm under “Learn more” button

Machine Learning Algorithm Configuration

For the best results from an algorithm sometimes it needs to be configured manually to behave ideally for the data However, it requires to be systematically tested suite of standard configuration.

Which Algorithm To Use

To get the idea about which algorithm to use going to test TOP 6 algorithms

  • functions.Logistic https://weka.sourceforge.io/doc.dev/weka/classifiers/functions/Logistic.html
  • bayes.NaiveBayes https://weka.sourceforge.io/doc.dev/weka/classifiers/bayes/NaiveBayes.html
  • lazy.IBk https://weka.sourceforge.io/doc.dev/weka/classifiers/lazy/IBk.html
  • functions.SMO https://weka.sourceforge.io/doc.dev/weka/classifiers/functions/SMO.html
  • trees.REPTree https://weka.sourceforge.io/doc.dev/weka/classifiers/trees/REPTree.html
  • trees.RandomForest https://weka.sourceforge.io/doc.dev/weka/classifiers/trees/RandomForest.html

Logistic


=== Summary ===

Correctly Classified Instances         250               63.2911 %
Incorrectly Classified Instances       145               36.7089 %
Kappa statistic                          0.2622
Mean absolute error                      0.151 
Root mean squared error                  0.3458
Relative absolute error                 79.4869 %
Root relative squared error            112.7207 %
Total Number of Instances              395     

=== Detailed Accuracy By Class ===

                 TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
                 0.801    0.361    0.837      0.801    0.819      0.428    0.823     0.916     very_low
                 0.267    0.119    0.345      0.267    0.301      0.164    0.645     0.315     low
                 0.192    0.070    0.161      0.192    0.175      0.112    0.650     0.119     normal
                 0.111    0.067    0.037      0.111    0.056      0.026    0.515     0.029     high
                 0.333    0.031    0.200      0.333    0.250      0.236    0.741     0.152     very_high
Weighted Avg.    0.633    0.282    0.666      0.633    0.648      0.344    0.769     0.712     

=== Confusion Matrix ===

   a   b   c   d   e   <-- classified as
 221  28  12   9   6 |   a = very_low
  34  20   9  11   1 |   b = low
   7   8   5   3   3 |   c = normal
   1   2   3   1   2 |   d = high
   1   0   2   3   3 |   e = very_high

Naive Bayes



=== Summary ===

Correctly Classified Instances         273               69.1139 %
Incorrectly Classified Instances       122               30.8861 %
Kappa statistic                          0.3038
Mean absolute error                      0.139 
Root mean squared error                  0.2945
Relative absolute error                 73.1455 %
Root relative squared error             95.9912 %
Total Number of Instances              395     

=== Detailed Accuracy By Class ===

                 TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
                 0.877    0.454    0.818      0.877    0.846      0.448    0.851     0.938     very_low
                 0.253    0.119    0.333      0.253    0.288      0.150    0.741     0.374     low
                 0.346    0.049    0.333      0.346    0.340      0.292    0.823     0.205     normal
                 0.000    0.010    0.000      0.000    0.000      -0.015   0.676     0.053     high
                 0.333    0.021    0.273      0.333    0.300      0.283    0.822     0.173     very_high
Weighted Avg.    0.691    0.344    0.663      0.691    0.675      0.367    0.824     0.745     

=== Confusion Matrix ===

   a   b   c   d   e   <-- classified as
 242  27   3   1   3 |   a = very_low
  44  19   9   1   2 |   b = low
   5   8   9   2   2 |   c = normal
   2   2   4   0   1 |   d = high
   3   1   2   0   3 |   e = very_high


Lazy



=== Summary ===

Correctly Classified Instances         247               62.5316 %
Incorrectly Classified Instances       148               37.4684 %
Kappa statistic                          0.152 
Mean absolute error                      0.154 
Root mean squared error                  0.3856
Relative absolute error                 81.0449 %
Root relative squared error            125.7104 %
Total Number of Instances              395     

=== Detailed Accuracy By Class ===

                 TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
                 0.808    0.613    0.753      0.808    0.780      0.206    0.591     0.747     very_low
                 0.213    0.144    0.258      0.213    0.234      0.075    0.509     0.200     low
                 0.269    0.046    0.292      0.269    0.280      0.232    0.610     0.125     normal
                 0.000    0.016    0.000      0.000    0.000      -0.019   0.541     0.025     high
                 0.111    0.016    0.143      0.111    0.125      0.108    0.565     0.038     very_high
Weighted Avg.    0.625    0.460    0.598      0.625    0.610      0.175    0.575     0.570     

=== Confusion Matrix ===

   a   b   c   d   e   <-- classified as
 223  36  11   5   1 |   a = very_low
  50  16   5   0   4 |   b = low
  13   5   7   1   0 |   c = normal
   5   2   1   0   1 |   d = high
   5   3   0   0   1 |   e = very_high


SMO

=== Summary ===

Correctly Classified Instances         273               69.1139 %
Incorrectly Classified Instances       122               30.8861 %
Kappa statistic                          0.2821
Mean absolute error                      0.2607
Root mean squared error                  0.3462
Relative absolute error                137.1955 %
Root relative squared error            112.8557 %
Total Number of Instances              395     

=== Detailed Accuracy By Class ===

                 TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
                 0.877    0.513    0.799      0.877    0.836      0.395    0.724     0.811     very_low
                 0.333    0.122    0.391      0.333    0.360      0.225    0.602     0.256     low
                 0.077    0.030    0.154      0.077    0.103      0.065    0.590     0.096     normal
                 0.111    0.010    0.200      0.111    0.143      0.134    0.293     0.021     high
                 0.333    0.018    0.300      0.333    0.316      0.299    0.948     0.245     very_high
Weighted Avg.    0.691    0.384    0.654      0.691    0.670      0.333    0.687     0.628     

=== Confusion Matrix ===

   a   b   c   d   e   <-- classified as
 242  30   4   0   0 |   a = very_low
  42  25   4   0   4 |   b = low
  13   7   2   2   2 |   c = normal
   4   2   1   1   1 |   d = high
   2   0   2   2   3 |   e = very_high


RepTree

=== Summary ===

Correctly Classified Instances         273               69.1139 %
Incorrectly Classified Instances       122               30.8861 %
Kappa statistic                          0.2624
Mean absolute error                      0.1437
Root mean squared error                  0.2836
Relative absolute error                 75.6314 %
Root relative squared error             92.4626 %
Total Number of Instances              395     

=== Detailed Accuracy By Class ===

                 TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
                 0.899    0.538    0.795      0.899    0.844      0.406    0.824     0.898     very_low
                 0.253    0.119    0.333      0.253    0.288      0.150    0.694     0.291     low
                 0.000    0.011    0.000      0.000    0.000      -0.027   0.718     0.139     normal
                 0.000    0.003    0.000      0.000    0.000      -0.008   0.715     0.108     high
                 0.667    0.039    0.286      0.667    0.400      0.418    0.973     0.300     very_high
Weighted Avg.    0.691    0.400    0.625      0.691    0.653      0.320    0.793     0.701     

=== Confusion Matrix ===

   a   b   c   d   e   <-- classified as
 248  24   1   1   2 |   a = very_low
  51  19   1   0   4 |   b = low
  10  10   0   0   6 |   c = normal
   3   3   0   0   3 |   d = high
   0   1   2   0   6 |   e = very_high

RandomForest


Correctly Classified Instances         275               69.6203 %
Incorrectly Classified Instances       120               30.3797 %
Kappa statistic                          0.1445
Mean absolute error                      0.1553
Root mean squared error                  0.2764
Relative absolute error                 81.7587 %
Root relative squared error             90.1143 %
Total Number of Instances              395     

=== Detailed Accuracy By Class ===

                 TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
                 0.949    0.790    0.736      0.949    0.829      0.245    0.840     0.930     very_low
                 0.147    0.059    0.367      0.147    0.210      0.129    0.759     0.393     low
                 0.000    0.005    0.000      0.000    0.000      -0.019   0.815     0.167     normal
                 0.000    0.003    0.000      0.000    0.000      -0.008   0.557     0.204     high
                 0.222    0.010    0.333      0.222    0.267      0.258    0.968     0.402     very_high
Weighted Avg.    0.696    0.564    0.591      0.696    0.625      0.200    0.819     0.749     

=== Confusion Matrix ===

   a   b   c   d   e   <-- classified as
 262  14   0   0   0 |   a = very_low
  60  11   1   0   3 |   b = low
  23   2   0   1   0 |   c = normal
   5   2   1   0   1 |   d = high
   6   1   0   0   2 |   e = very_high

Summary

There are three things to note in the performance summary for classification algorithms

  • Classification accuracy:
    This the ratio of the number of correct predictions out of all predictions made, often presented as a percentage where 100% is the best an algorithm can achieve.
  • Accuracy by class:
    true-positive and false-positive rates for the predictions for each class.
  • Confusion matrix:
    A table showing the number of predictions for each class compared to the number of instances that actually belong to each class.

Categories:

Updated: