I'm using WEKA API in Java to perform sampling methods on a dataset. Two of
the sampling methods I am performing are oversampling and undersampling. I
know I need to use the Resample filter, but I am confused as to what each
attribute of the filter needs to be set to in order to perform oversampling
Would anyone be able to advise me on using the resample filter for
oversampling and undersampling?
where data.numInstances() gives the total number of instances in the dataset, numInstancesPerClass[i] holds the number of instances in class i and numActualClasses corresponds to the number of classes that actually occur in the dataset (some classes declared in an ARFF file may not have any instances in the data).
Assuming you have only two classes, you can do the following.
To undersample the majority class so that both classes have the same number of instances, use noReplacement=true, biasToUniformClass=1.0, and sampleSizePercent=X, where X/2 is (approximately) the percentage of data that belongs to the minority class.
For example, on the diabetes data that comes with WEKA, you can use the following configuration:
You will probably need to fiddle with the -Z parameter (sampleSizePercent) a bit to keep all the instances of the minority class. Watch out for something like "WARNING: Not enough instances of tested_positive for selected value of bias parameter in supervised Resample filter when sampling without replacement.” It means the value specified by -Z is too large.
A much easier way to achieve the same effect is to use the SpreadSubsample filter instead, with distributionSpread=1.0:
To oversample the minority class so that both classes have the same number of instances, use the supervised Resample filter with noReplacement=false, biasToUniformClass=1.0, and sampleSizePercent=Y, where Y/2 is (approximately) the percentage of data that belongs to the majority class. Example for the diabetes data:
Note that this will apply sampling *with* replacement to the majority class as well, so it may not be ideal for your application! To get oversampling of the minority class and keep the majority class untouched, you may need to write your own program or use the KnowledgeFlow.