Sampling using Resample

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Sampling using Resample

lindsp
I am using WEKA for class imbalance problems and as a warmup I need to use
WEKA to:
                    1. Oversample
                    2. Undersample
                    3. SMOTE
                    4. ROSE


I understand what these sampling methods are, I just don't understand how to
apply Resample so as to cause these sampling methods to occur. If someone
knows what each of the values in Resample should be set to in order to
practice these sampling methods, I would greatly appreciate it. These values
include:
         1. biasToUniformClass
         2. debug (true/false)
         3. doNotCheckCapabilities (true/false)
         4. invertSelection (true/false)
         5. noReplacement (true/false)
         6. randomSeed
         7. sampleSizePercent

Thank you!



--
Sent from: http://weka.8497.n7.nabble.com/
_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Sampling using Resample

Eibe Frank-2
Administrator
SMOTE is implemented as a separate filter in WEKA. It is not part of Resample. I am not familiar with ROSE.

.supervised.instance.Resample uses the following expression to determine the number of instances to sample for a particular class i:

int sampleSize = (int)((m_SampleSizePercent / 100.0) * ((1 - m_BiasToUniformClass) * numInstancesPerClass[i] +
  m_BiasToUniformClass * data.numInstances() / numActualClasses));

where data.numInstances() gives the total number of instances in the dataset, numInstancesPerClass[i] holds the number of instances in class i and numActualClasses corresponds to the number of classes that actually occur in the dataset (some classes declared in an ARFF file may not have any instances in the data).

Assuming you have only two classes, you can do the following.

To undersample the majority class so that both classes have the same number of instances, use noReplacement=true, biasToUniformClass=1.0, and sampleSizePercent=X, where X/2 is (approximately) the percentage of data that belongs to the minority class.

For example, on the diabetes data that comes with WEKA, you can use the following configuration:

  weka.filters.supervised.instance.Resample -B 1.0 -Z 69.8 -no-replacement

You will probably need to fiddle with the -Z parameter (sampleSizePercent) a bit to keep all the instances of the minority class. Watch out for something like "WARNING: Not enough instances of tested_positive for selected value of bias parameter in supervised Resample filter when sampling without replacement.” It means the value specified by -Z is too large.

A much easier way to achieve the same effect is to use the SpreadSubsample filter instead, with distributionSpread=1.0:

  weka.filters.supervised.instance.SpreadSubsample -M 1.0

To oversample the minority class so that both classes have the same number of instances, use the supervised Resample filter with noReplacement=false, biasToUniformClass=1.0, and sampleSizePercent=Y, where Y/2 is (approximately) the percentage of data that belongs to the majority class. Example for the diabetes data:

  weka.filters.supervised.instance.Resample -B 1.0 -Z 130.3

Note that this will apply sampling *with* replacement to the majority class as well, so it may not be ideal for your application! To get oversampling of the minority class and keep the majority class untouched, you may need to write your own program or use the KnowledgeFlow.

Cheers,
Eibe

> On 20/10/2018, at 5:21 AM, lindsp <[hidden email]> wrote:
>
> I am using WEKA for class imbalance problems and as a warmup I need to use
> WEKA to:
>                    1. Oversample
>                    2. Undersample
>                    3. SMOTE
>                    4. ROSE
>
>
> I understand what these sampling methods are, I just don't understand how to
> apply Resample so as to cause these sampling methods to occur. If someone
> knows what each of the values in Resample should be set to in order to
> practice these sampling methods, I would greatly appreciate it. These values
> include:
>         1. biasToUniformClass
>         2. debug (true/false)
>         3. doNotCheckCapabilities (true/false)
>         4. invertSelection (true/false)
>         5. noReplacement (true/false)
>         6. randomSeed
>         7. sampleSizePercent
>
> Thank you!
>
>
>
> --
> Sent from: http://weka.8497.n7.nabble.com/
> _______________________________________________
> Wekalist mailing list
> Send posts to: [hidden email]
> List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
> List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Sampling using Resample

David
Eibe Frank-2 wrote
> I am not familiar with ROSE.

Dear Eibe,
  ROSE combines oversampling and undersampling for generating a sample of
the data and it is reported in G. Menardi and N. Torelli, “ Training and
assessing classification rules with imbalanced data
<https://core.ac.uk/download/pdf/41172947.pdf>  ,” Data Mining and Knowledge
Discovery, pp. 1–31, 2014. It would be great if WEKA would support it.



--
Sent from: http://weka.8497.n7.nabble.com/
_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Sampling using Resample

Peter Reutemann
> Eibe Frank-2 wrote
> > I am not familiar with ROSE.
>
> Dear Eibe,
>   ROSE combines oversampling and undersampling for generating a sample of
> the data and it is reported in G. Menardi and N. Torelli, “ Training and
> assessing classification rules with imbalanced data
> <https://core.ac.uk/download/pdf/41172947.pdf>  ,” Data Mining and Knowledge
> Discovery, pp. 1–31, 2014. It would be great if WEKA would support it.

We're always open for contributions! :-)

Cheers, Peter
--
Peter Reutemann
Dept. of Computer Science
University of Waikato, NZ
+64 (7) 858-5174
http://www.cms.waikato.ac.nz/~fracpete/
http://www.data-mining.co.nz/
_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html