Problem with Evaluation of test data when using SMOTE on training data

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Problem with Evaluation of test data when using SMOTE on training data

Oskar
I've got a small problem with LibSVM and SMOTE that hopefully someone could give me a little help with.

As a hobby project, I have started experimenting a little with Weka and LibSVM using java. I'm using a classification model and I have two classes which are unbalanced (about 10-1). I get both the training and the test set data from my local MySQL database, with the retrieveInstances() method, not from any arff file. I'm currently using SMOTE to balance the classes in the training data set.

But I'm having surprisingly much trouble getting it to work. I noticed however that if both the training data set and the test data set start with a row where the class (desired value) was the same, then it seems to work. Both data sets have to start with class A, the less common class. The Confusion Matrix looks normal then, and the areaUnderROC value is good (suspiciously good, even).

Below is part of my code. I simply can't understand why the test data won't work just because it doesn't begin with a certain class. I'm not even using SMOTE on the test data. And I use two separate Evaluation objects, one for training data and one for test data.

SMOTE smote = new SMOTE();
smote.setInputFormat(trainingInstances);
smote.setPercentage(percent);
smote.setNearestNeighbors(5);
   
FilteredClassifier fc = new FilteredClassifier();
fc.setClassifier(libsvm);
fc.setFilter(smote);
fc.buildClassifier(trainingInstances);  
       
Evaluation trainEval = new Evaluation(trainingInstances);
trainEval.evaluateModel(libsvm, trainingInstances);

// The evaluation of the test data, which shouldn't use SMOTE.
Evaluation testEval = new Evaluation(testInstances);
testEval.evaluateModel(libsvm, testInstances);

It seems that the correlation value and the number of correctly classified rows are always the same in both cases, which is good at least, as I manually calculate these. But any method from the Evaluation class gives weird results when the first row in the training and test data don't have the same desired value.

The variable testEval.areaUnderROC(0) is completely different if I don't start both data sets with the same class in the first row. With a different class at the first row in the test data, the areaUnderROC value suggests a negative correlation even.

Has anyone experienced similar problems, or know the solution to it? What am I doing wrong?

_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Problem with Evaluation of test data when using SMOTE on training data

Peter Reutemann
> I've got a small problem with LibSVM and SMOTE that hopefully someone could give me a little help with.
>
> As a hobby project, I have started experimenting a little with Weka and LibSVM using java. I'm using a classification model and I have two classes which are unbalanced (about 10-1). I get both the training and the test set data from my local MySQL database, with the retrieveInstances() method, not from any arff file. I'm currently using SMOTE to balance the classes in the training data set.
>
> But I'm having surprisingly much trouble getting it to work. I noticed however that if both the training data set and the test data set start with a row where the class (desired value) was the same, then it seems to work. Both data sets have to start with class A, the less common class. The Confusion Matrix looks normal then, and the areaUnderROC value is good (suspiciously good, even).
>
> Below is part of my code. I simply can't understand why the test data won't work just because it doesn't begin with a certain class. I'm not even using SMOTE on the test data. And I use two separate Evaluation objects, one for training data and one for test data.
>
> SMOTE smote = new SMOTE();
> smote.setInputFormat(trainingInstances);
> smote.setPercentage(percent);
> smote.setNearestNeighbors(5);
>
> FilteredClassifier fc = new FilteredClassifier();
> fc.setClassifier(libsvm);
> fc.setFilter(smote);
> fc.buildClassifier(trainingInstances);
>
> Evaluation trainEval = new Evaluation(trainingInstances);
> trainEval.evaluateModel(libsvm, trainingInstances);
>
> // The evaluation of the test data, which shouldn't use SMOTE.
> Evaluation testEval = new Evaluation(testInstances);
> testEval.evaluateModel(libsvm, testInstances);
>
> It seems that the correlation value and the number of correctly classified rows are always the same in both cases, which is good at least, as I manually calculate these. But any method from the Evaluation class gives weird results when the first row in the training and test data don't have the same desired value.
>
> The variable testEval.areaUnderROC(0) is completely different if I don't start both data sets with the same class in the first row. With a different class at the first row in the test data, the areaUnderROC value suggests a negative correlation even.
>
> Has anyone experienced similar problems, or know the solution to it? What am I doing wrong?

You need to evaluate the FilteredClassifier (which wraps LibSVM), not LibSVM.

Cheers, Peter
--
Peter Reutemann
Dept. of Computer Science
University of Waikato, NZ
+64 (7) 858-5174
http://www.cms.waikato.ac.nz/~fracpete/
http://www.data-mining.co.nz/
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Problem with Evaluation of test data when using SMOTE on training data

Oskar
I've now tried your suggestion, but I'm still seeing the same problem
unfortunately.

I changed the evaluation of the train instances to:
trainEval.evaluateModel(fc, trainingInstances);

For the test instances, I've tried both with "fc" and "libsvm" in
evaluateModel(), but neither worked.

When both the training data and test data start with the same class at the
first row, I get this result:
trainPctCorrect: 87.244 , testPctCorrect: 86.633
trainEval.areaUnderROC(0): 0.793 , testEval.areaUnderROC(0): 0.800

But as soon as the test data doesn't start with the same class I get this
poor result from the same test data:
trainPctCorrect: 87.244 , testPctCorrect: 13.367
trainEval.areaUnderROC(0): 0.793 , testEval.areaUnderROC(0): 0.200

I just can't understand why the evaluation would be affected by which class
the test data starts with.



--
Sent from: https://weka.8497.n7.nabble.com/
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Problem with Evaluation of test data when using SMOTE on training data

Peter Reutemann
> I've now tried your suggestion, but I'm still seeing the same problem
> unfortunately.
>
> I changed the evaluation of the train instances to:
> trainEval.evaluateModel(fc, trainingInstances);
>
> For the test instances, I've tried both with "fc" and "libsvm" in
> evaluateModel(), but neither worked.
>
> When both the training data and test data start with the same class at the
> first row, I get this result:
> trainPctCorrect: 87.244 , testPctCorrect: 86.633
> trainEval.areaUnderROC(0): 0.793 , testEval.areaUnderROC(0): 0.800
>
> But as soon as the test data doesn't start with the same class I get this
> poor result from the same test data:
> trainPctCorrect: 87.244 , testPctCorrect: 13.367
> trainEval.areaUnderROC(0): 0.793 , testEval.areaUnderROC(0): 0.200
>
> I just can't understand why the evaluation would be affected by which class
> the test data starts with.

I only looked at your code from an API point of view, as I don't know
what your datasets are.

BTW drop the following line as well, the FilteredClassifier does that
automatically:
smote.setInputFormat(trainingInstances);

Are training and test set actually compatible?

String msg = trainingInstances.equalHeadersMsg(testInstances);
if (msg != null)
  throw new Exception("Incompatible datasets:\n" + msg);

https://weka.sourceforge.io/doc.dev/weka/core/Instances.html#equalHeadersMsg-weka.core.Instances-

Cheers, Peter
--
Peter Reutemann
Dept. of Computer Science
University of Waikato, NZ
+64 (7) 858-5174
http://www.cms.waikato.ac.nz/~fracpete/
http://www.data-mining.co.nz/
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Problem with Evaluation of test data when using SMOTE on training data

Oskar
You were right about the datasets being incompatible, I appreciate your help.
Below is the error message I got after adding your code:

Exception in thread "main" java.lang.Exception: Incompatible datasets:
Attributes differ at position 40:
Labels differ at position 1: yes != no

But I didn't know you were supposed to add headers or labels to
training/testing data that's from a database. With ARFF files you define
headers of course, but I've never seen this for database data.

Am I supposed to send the label names as the first row in my database data?
It seems like if Weka believes that row one in my data set contains the
labels (column names) instead of just input values.



--
Sent from: https://weka.8497.n7.nabble.com/
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Problem with Evaluation of test data when using SMOTE on training data

Peter Reutemann
> You were right about the datasets being incompatible, I appreciate your help.
> Below is the error message I got after adding your code:
>
> Exception in thread "main" java.lang.Exception: Incompatible datasets:
> Attributes differ at position 40:
> Labels differ at position 1: yes != no
>
> But I didn't know you were supposed to add headers or labels to
> training/testing data that's from a database. With ARFF files you define
> headers of course, but I've never seen this for database data.
>
> Am I supposed to send the label names as the first row in my database data?
> It seems like if Weka believes that row one in my data set contains the
> labels (column names) instead of just input values.

Weka uses indices internally. If the order of your labels changes, the
meaning of the indices changes, of course.
That's why using CSV files or databases is a bit tricky.

You might try wrapping your FilteredClassifier in the InputMappedClassifier:
https://weka.sourceforge.io/doc.dev/weka/classifiers/misc/InputMappedClassifier.html

Haven't used it myself, but that should hopefully sort out the label order.

Cheers, Peter
--
Peter Reutemann
Dept. of Computer Science
University of Waikato, NZ
+64 (7) 858-5174
http://www.cms.waikato.ac.nz/~fracpete/
http://www.data-mining.co.nz/
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Problem with Evaluation of test data when using SMOTE on training data

Oskar
Thank you for your help. I will try with InputMappedClassifier. But it's a
pity that Weka doesn't seem to have a method where you can manually
set/update the header of an Instances object. Otherwise you are at a big
disadvantage if you choose to get your Instances from a database instead of
from an ARFF file.

Perhaps I will eventually be forced to create ARFF files on the fly from my
database data. But it feels unnecessary, because I already have the data put
into Instances.

BTW, is there a method that displays the complete header/structure for a
given dataset? (For an Instances object, I mean.) Then I can at least see
how Weka is viewing my training and test set respectively, and hopefully see
exactly where the difference is.




--
Sent from: https://weka.8497.n7.nabble.com/
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Problem with Evaluation of test data when using SMOTE on training data

Peter Reutemann
> Thank you for your help. I will try with InputMappedClassifier. But it's a
> pity that Weka doesn't seem to have a method where you can manually
> set/update the header of an Instances object. Otherwise you are at a big
> disadvantage if you choose to get your Instances from a database instead of
> from an ARFF file.
>
> Perhaps I will eventually be forced to create ARFF files on the fly from my
> database data. But it feels unnecessary, because I already have the data put
> into Instances.

If you've already have a dataset structure, you could then just
transfer the data from the database (e.g. obtained via JDBC queries or
via the InstancesQuery class) into new Instance (dense or sparse)
objects. The following example shows how to generate
Instances/Instance objects on the fly:
https://waikato.github.io/weka-wiki/formats_and_processing/creating_arff_file/

> BTW, is there a method that displays the complete header/structure for a
> given dataset? (For an Instances object, I mean.) Then I can at least see
> how Weka is viewing my training and test set respectively, and hopefully see
> exactly where the difference is.

Simply call toString() on a Instances object. If you don't want the
data to be printed, create an empty copy of the datasets:
Instances mydata = ...  // your dataset
Instances structure = new Instances(mydata, 0);
System.out.println(structure);

Cheers, Peter
--
Peter Reutemann
Dept. of Computer Science
University of Waikato, NZ
+64 (7) 858-5174
http://www.cms.waikato.ac.nz/~fracpete/
http://www.data-mining.co.nz/
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Problem with Evaluation of test data when using SMOTE on training data

Oskar
I finally got it to work. I appreciate the help I received. Here is the cause
of the error (as far as I can tell) and what I did to work around it, in
case anyone wants to know.

I printed out the headers of my training data and test data for comparison.
They are identical, except that in the desired value, which is a nominal
attribute, the values are displayed in reverse order.

For example like this for the training data:
@attribute the_result {yes,no}

and like this for the test data:
@attribute the_result {no,yes}

The reason for this is that the nominal values for that attribute are placed
in that order in the data. It just happens to be in that order in my
database. The method retrieveInstances() creates data headers automatically
based on the data in the database. Often it turns out fine, at least for
regression data.

But with nominal attributes it can cause a problem. In Weka's eyes, my
datasets are incompatible just because of different order of values.

I created my own version of the retrieveInstances() method, where I manually
defined the header of the Instances object just like I wanted it to be,
regardless of the order of rows. It now works when I use this homemade
method instead of retrieveInstances().



--
Sent from: https://weka.8497.n7.nabble.com/
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html