Choosing Between ClassBalancer, Resample, and SpreadSubsample Filters

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Choosing Between ClassBalancer, Resample, and SpreadSubsample Filters

Jeff Pattillo
I work with healthcare data and the specific problem I am working on right now has very unbalanced classes (roughly 10:1).  I tried including 10 copies of the smaller class for every 1 instance of the bigger class, but the classifier that resulted did not generalize very well.  The model seems to be picking up on features of the smaller class because it was artificially enlarged in such a uniform way.  I am thinking I might get better results by reducing the size of the larger class via sampling.

Does anyone have extensive experience in this?  Is this the right way to go?

I was looking at the three supervised filters ClassBalancer, Resample, and SpreadSubsample and it seems I can get to classes of equal size using all 3.  ClassBalancer does this automatically, Resample does it if you set "biasToUniformClass=1.0", and SpreadSubsample does it if you set "distributionSpread=1.0".  With ClassBalancer and SpreadSubsample you get slightly odd results. If you run ClassBalancer on diabetes.arff, you get Counts that differ for the class, even though the weights are the same.  If you run SpreadSubsample on diabetes.arff, with both "AdjustWeights=True" and "distributionSpread=1.0", you get Counts that are the same for the class, but the weights differ.

Which of these samplers do you recommend?  What does it mean to have different counts but equal weights for the class?  What does it mean to have equal counts but different weights for the class?

One final question.  I've seen it recommended to repeatedly sample the data and build a classifier, and ultimately create a hybrid classifier made up of all the classifiers built from the samples.  Is there a way to create such a classifier in WEKA using a metaclassifier?

Thanks for the help as always!

Jeff

_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: http://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Choosing Between ClassBalancer, Resample, and SpreadSubsample Filters

Martin
Each of the filters you mentioned is fine as long as you are using it in conjunction with "FilteredClassifier". Your data characteristics lead the performance of the filter you apply.


If you are bout using the ensemble learning, you can select, e.g., Bagging for this purpose in conjunction with FilteredClassifier, and any of the filters you mentioned.

Regards,
Martin


On 6 January 2016 at 23:22, Jeff Pattillo <[hidden email]> wrote:
I work with healthcare data and the specific problem I am working on right now has very unbalanced classes (roughly 10:1).  I tried including 10 copies of the smaller class for every 1 instance of the bigger class, but the classifier that resulted did not generalize very well.  The model seems to be picking up on features of the smaller class because it was artificially enlarged in such a uniform way.  I am thinking I might get better results by reducing the size of the larger class via sampling.

Does anyone have extensive experience in this?  Is this the right way to go?

I was looking at the three supervised filters ClassBalancer, Resample, and SpreadSubsample and it seems I can get to classes of equal size using all 3.  ClassBalancer does this automatically, Resample does it if you set "biasToUniformClass=1.0", and SpreadSubsample does it if you set "distributionSpread=1.0".  With ClassBalancer and SpreadSubsample you get slightly odd results. If you run ClassBalancer on diabetes.arff, you get Counts that differ for the class, even though the weights are the same.  If you run SpreadSubsample on diabetes.arff, with both "AdjustWeights=True" and "distributionSpread=1.0", you get Counts that are the same for the class, but the weights differ.

Which of these samplers do you recommend?  What does it mean to have different counts but equal weights for the class?  What does it mean to have equal counts but different weights for the class?

One final question.  I've seen it recommended to repeatedly sample the data and build a classifier, and ultimately create a hybrid classifier made up of all the classifiers built from the samples.  Is there a way to create such a classifier in WEKA using a metaclassifier?

Thanks for the help as always!

Jeff

_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: http://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html



_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: http://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Choosing Between ClassBalancer, Resample, and SpreadSubsample Filters

Eibe Frank-2
Administrator
In reply to this post by Jeff Pattillo
In WEKA, each instance in a dataset can have a weight. All learning algorithms that make use of instance weights, when they are provided, implement the WeightedInstancesHandler interface (the relevant ones in the core distribution for WEKA 3.7 can be seen in the Javadoc at http://weka.sourceforge.net/doc.dev/weka/core/WeightedInstancesHandler.html). For example, instead of duplicating an instance, you can simply give it the weight two. This *should* have the same (or approximately the same) effect if the learning algorithm is a WeightedInstancesHandler.

ClassBalancer simply reweights the instances so that the sum of weights for all classes of instances in the data is the same. No instances are deleted or added, so the count for each class remains unchanged.

In contrast, SpreadSubsample, when "AdjustWeights=true", after it has resampled the data to achieve the desired spread, modifies the instance weights so that the "total weight per class is maintained" (i.e., is the same as in the original dataset).

Cheers,
Eibe

> On 7 Jan 2016, at 04:22, Jeff Pattillo <[hidden email]> wrote:
>
> I work with healthcare data and the specific problem I am working on right now has very unbalanced classes (roughly 10:1).  I tried including 10 copies of the smaller class for every 1 instance of the bigger class, but the classifier that resulted did not generalize very well.  The model seems to be picking up on features of the smaller class because it was artificially enlarged in such a uniform way.  I am thinking I might get better results by reducing the size of the larger class via sampling.
>
> Does anyone have extensive experience in this?  Is this the right way to go?
>
> I was looking at the three supervised filters ClassBalancer, Resample, and SpreadSubsample and it seems I can get to classes of equal size using all 3.  ClassBalancer does this automatically, Resample does it if you set "biasToUniformClass=1.0", and SpreadSubsample does it if you set "distributionSpread=1.0".  With ClassBalancer and SpreadSubsample you get slightly odd results. If you run ClassBalancer on diabetes.arff, you get Counts that differ for the class, even though the weights are the same.  If you run SpreadSubsample on diabetes.arff, with both "AdjustWeights=True" and "distributionSpread=1.0", you get Counts that are the same for the class, but the weights differ.
>
> Which of these samplers do you recommend?  What does it mean to have different counts but equal weights for the class?  What does it mean to have equal counts but different weights for the class?
>
> One final question.  I've seen it recommended to repeatedly sample the data and build a classifier, and ultimately create a hybrid classifier made up of all the classifiers built from the samples.  Is there a way to create such a classifier in WEKA using a metaclassifier?
>
> Thanks for the help as always!
>
> Jeff
> _______________________________________________
> Wekalist mailing list
> Send posts to: [hidden email]
> List info and subscription status: http://list.waikato.ac.nz/mailman/listinfo/wekalist
> List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: http://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: SMOTE filter in weka

ajmal
In reply to this post by Jeff Pattillo
Sir
plz tell me that how smote filter works actually?



--
Sent from: https://weka.8497.n7.nabble.com/
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html