# Machine Learning Weka

Perform investigation of algorithms for generating different rules and decision trees using Weka software Formulate generalizing conclusions

## Goals

• Generate or collect training data
• The number of attributes is `5-10`, the number of data records is `10-100000`
• `6` Algorithms to be tested

# Introduction

Weka is a data mining software that supports a large number of machine learning algorithms. The downside is that it can be overwhelming to know which algorithms to use, and when. However, testing through them helps to get used for their non-standard names that may not be familiar and to learn more about solving the problem and getting closer to discover the few algorithms that perform best in our context.

# DATASET

The data set I’m using is Student Alcohol Consumption from kaggle.com where the data were obtained in a survey of students math courses in secondary school. It contains social, gender and study information about students `https://www.kaggle.com/uciml/student-alcohol-consumption/version/1#student-mat.csv`

To make data understandable by Weka, the dataset is formated to Weka understandable format ARFF which has a list of the attributes (the columns in the data), and their types. More about ARFF format `https://www.cs.waikato.ac.nz/ml/weka/arff.html`

# Weka Machine Learning Algorithms

Various different algorithms are found under classify tab of the Explorer and divided into main groups: bytes, function, lazy, meta and etc. And provides more information about each machine learning algorithm under “Learn more” button

# Machine Learning Algorithm Configuration

For the best results from an algorithm sometimes it needs to be configured manually to behave ideally for the data However, it requires to be systematically tested suite of standard configuration.

# Which Algorithm To Use

To get the idea about which algorithm to use going to test TOP 6 algorithms

• functions.Logistic `https://weka.sourceforge.io/doc.dev/weka/classifiers/functions/Logistic.html`
• bayes.NaiveBayes `https://weka.sourceforge.io/doc.dev/weka/classifiers/bayes/NaiveBayes.html`
• lazy.IBk `https://weka.sourceforge.io/doc.dev/weka/classifiers/lazy/IBk.html`
• functions.SMO `https://weka.sourceforge.io/doc.dev/weka/classifiers/functions/SMO.html`
• trees.REPTree `https://weka.sourceforge.io/doc.dev/weka/classifiers/trees/REPTree.html`
• trees.RandomForest `https://weka.sourceforge.io/doc.dev/weka/classifiers/trees/RandomForest.html`

## Logistic

``````
=== Summary ===

Correctly Classified Instances         250               63.2911 %
Incorrectly Classified Instances       145               36.7089 %
Kappa statistic                          0.2622
Mean absolute error                      0.151
Root mean squared error                  0.3458
Relative absolute error                 79.4869 %
Root relative squared error            112.7207 %
Total Number of Instances              395

=== Detailed Accuracy By Class ===

TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
0.801    0.361    0.837      0.801    0.819      0.428    0.823     0.916     very_low
0.267    0.119    0.345      0.267    0.301      0.164    0.645     0.315     low
0.192    0.070    0.161      0.192    0.175      0.112    0.650     0.119     normal
0.111    0.067    0.037      0.111    0.056      0.026    0.515     0.029     high
0.333    0.031    0.200      0.333    0.250      0.236    0.741     0.152     very_high
Weighted Avg.    0.633    0.282    0.666      0.633    0.648      0.344    0.769     0.712

=== Confusion Matrix ===

a   b   c   d   e   <-- classified as
221  28  12   9   6 |   a = very_low
34  20   9  11   1 |   b = low
7   8   5   3   3 |   c = normal
1   2   3   1   2 |   d = high
1   0   2   3   3 |   e = very_high

``````

## Naive Bayes

``````

=== Summary ===

Correctly Classified Instances         273               69.1139 %
Incorrectly Classified Instances       122               30.8861 %
Kappa statistic                          0.3038
Mean absolute error                      0.139
Root mean squared error                  0.2945
Relative absolute error                 73.1455 %
Root relative squared error             95.9912 %
Total Number of Instances              395

=== Detailed Accuracy By Class ===

TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
0.877    0.454    0.818      0.877    0.846      0.448    0.851     0.938     very_low
0.253    0.119    0.333      0.253    0.288      0.150    0.741     0.374     low
0.346    0.049    0.333      0.346    0.340      0.292    0.823     0.205     normal
0.000    0.010    0.000      0.000    0.000      -0.015   0.676     0.053     high
0.333    0.021    0.273      0.333    0.300      0.283    0.822     0.173     very_high
Weighted Avg.    0.691    0.344    0.663      0.691    0.675      0.367    0.824     0.745

=== Confusion Matrix ===

a   b   c   d   e   <-- classified as
242  27   3   1   3 |   a = very_low
44  19   9   1   2 |   b = low
5   8   9   2   2 |   c = normal
2   2   4   0   1 |   d = high
3   1   2   0   3 |   e = very_high

``````

## Lazy

``````

=== Summary ===

Correctly Classified Instances         247               62.5316 %
Incorrectly Classified Instances       148               37.4684 %
Kappa statistic                          0.152
Mean absolute error                      0.154
Root mean squared error                  0.3856
Relative absolute error                 81.0449 %
Root relative squared error            125.7104 %
Total Number of Instances              395

=== Detailed Accuracy By Class ===

TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
0.808    0.613    0.753      0.808    0.780      0.206    0.591     0.747     very_low
0.213    0.144    0.258      0.213    0.234      0.075    0.509     0.200     low
0.269    0.046    0.292      0.269    0.280      0.232    0.610     0.125     normal
0.000    0.016    0.000      0.000    0.000      -0.019   0.541     0.025     high
0.111    0.016    0.143      0.111    0.125      0.108    0.565     0.038     very_high
Weighted Avg.    0.625    0.460    0.598      0.625    0.610      0.175    0.575     0.570

=== Confusion Matrix ===

a   b   c   d   e   <-- classified as
223  36  11   5   1 |   a = very_low
50  16   5   0   4 |   b = low
13   5   7   1   0 |   c = normal
5   2   1   0   1 |   d = high
5   3   0   0   1 |   e = very_high

``````

## SMO

``````=== Summary ===

Correctly Classified Instances         273               69.1139 %
Incorrectly Classified Instances       122               30.8861 %
Kappa statistic                          0.2821
Mean absolute error                      0.2607
Root mean squared error                  0.3462
Relative absolute error                137.1955 %
Root relative squared error            112.8557 %
Total Number of Instances              395

=== Detailed Accuracy By Class ===

TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
0.877    0.513    0.799      0.877    0.836      0.395    0.724     0.811     very_low
0.333    0.122    0.391      0.333    0.360      0.225    0.602     0.256     low
0.077    0.030    0.154      0.077    0.103      0.065    0.590     0.096     normal
0.111    0.010    0.200      0.111    0.143      0.134    0.293     0.021     high
0.333    0.018    0.300      0.333    0.316      0.299    0.948     0.245     very_high
Weighted Avg.    0.691    0.384    0.654      0.691    0.670      0.333    0.687     0.628

=== Confusion Matrix ===

a   b   c   d   e   <-- classified as
242  30   4   0   0 |   a = very_low
42  25   4   0   4 |   b = low
13   7   2   2   2 |   c = normal
4   2   1   1   1 |   d = high
2   0   2   2   3 |   e = very_high

``````

## RepTree

``````=== Summary ===

Correctly Classified Instances         273               69.1139 %
Incorrectly Classified Instances       122               30.8861 %
Kappa statistic                          0.2624
Mean absolute error                      0.1437
Root mean squared error                  0.2836
Relative absolute error                 75.6314 %
Root relative squared error             92.4626 %
Total Number of Instances              395

=== Detailed Accuracy By Class ===

TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
0.899    0.538    0.795      0.899    0.844      0.406    0.824     0.898     very_low
0.253    0.119    0.333      0.253    0.288      0.150    0.694     0.291     low
0.000    0.011    0.000      0.000    0.000      -0.027   0.718     0.139     normal
0.000    0.003    0.000      0.000    0.000      -0.008   0.715     0.108     high
0.667    0.039    0.286      0.667    0.400      0.418    0.973     0.300     very_high
Weighted Avg.    0.691    0.400    0.625      0.691    0.653      0.320    0.793     0.701

=== Confusion Matrix ===

a   b   c   d   e   <-- classified as
248  24   1   1   2 |   a = very_low
51  19   1   0   4 |   b = low
10  10   0   0   6 |   c = normal
3   3   0   0   3 |   d = high
0   1   2   0   6 |   e = very_high

``````

## RandomForest

``````
Correctly Classified Instances         275               69.6203 %
Incorrectly Classified Instances       120               30.3797 %
Kappa statistic                          0.1445
Mean absolute error                      0.1553
Root mean squared error                  0.2764
Relative absolute error                 81.7587 %
Root relative squared error             90.1143 %
Total Number of Instances              395

=== Detailed Accuracy By Class ===

TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
0.949    0.790    0.736      0.949    0.829      0.245    0.840     0.930     very_low
0.147    0.059    0.367      0.147    0.210      0.129    0.759     0.393     low
0.000    0.005    0.000      0.000    0.000      -0.019   0.815     0.167     normal
0.000    0.003    0.000      0.000    0.000      -0.008   0.557     0.204     high
0.222    0.010    0.333      0.222    0.267      0.258    0.968     0.402     very_high
Weighted Avg.    0.696    0.564    0.591      0.696    0.625      0.200    0.819     0.749

=== Confusion Matrix ===

a   b   c   d   e   <-- classified as
262  14   0   0   0 |   a = very_low
60  11   1   0   3 |   b = low
23   2   0   1   0 |   c = normal
5   2   1   0   1 |   d = high
6   1   0   0   2 |   e = very_high

``````

## Summary

There are three things to note in the performance summary for classification algorithms

• Classification accuracy:
This the ratio of the number of correct predictions out of all predictions made, often presented as a percentage where 100% is the best an algorithm can achieve.
• Accuracy by class:
`true-positive` and `false-positive` rates for the predictions for each class.
• Confusion matrix:
A table showing the number of predictions for each class compared to the number of instances that actually belong to each class.

Categories:

Updated: