SMOTE - sample generation

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

SMOTE - sample generation

glemaitre
Dear all,

I am maintaining a library in Python implementing SMOTE. We recently got an inquiry from a user regarding the algorithm generating a new instance.

I am unsure whether or not the gap parameter should be generated for each attribute or a single time, and shared across all attributes. In this regard, the first option would generate the a new sample in the hypercube defined by the two selected samples while it will only generate a sample on the hyperplane in the second case.

The original paper might be a bit vague regarding this part since that the pseudo-code seems to support the first option while the text description ("This causes the selection of a random point along the line segment between two specific features") and example (Table 1) seem to go for the second solution.

I was wondering which implementation did you go for in WEKA and if there is actually a "right" supported solution there?

Cheers,
--
Guillaume Lemaitre
INRIA Saclay - Parietal team
Center for Data Science Paris-Saclay
https://glemaitre.github.io/

_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: SMOTE - sample generation

Eibe Frank-2
Administrator
The SMOTE filter for WEKA was contributed as a package by Ryan Lichtenwalter. Here is the relevant code:

double dif = nnArray[nn].value(attr) - instanceI.value(attr);
double gap = rand.nextDouble();
values[attr.index()] = (double) (instanceI.value(attr) + gap * dif);

So that implementation picks a random gap value separately for each attribute. I would have chosen the second approach. Have you compared the two?

Cheers,
Eibe

> On 20/10/2018, at 5:43 AM, Guillaume Lemaître <[hidden email]> wrote:
>
> Dear all,
>
> I am maintaining a library in Python implementing SMOTE. We recently got an inquiry from a user regarding the algorithm generating a new instance.
>
> I am unsure whether or not the gap parameter should be generated for each attribute or a single time, and shared across all attributes. In this regard, the first option would generate the a new sample in the hypercube defined by the two selected samples while it will only generate a sample on the hyperplane in the second case.
>
> The original paper might be a bit vague regarding this part since that the pseudo-code seems to support the first option while the text description ("This causes the selection of a random point along the line segment between two specific features") and example (Table 1) seem to go for the second solution.
>
> I was wondering which implementation did you go for in WEKA and if there is actually a "right" supported solution there?
>
> Cheers,
> --
> Guillaume Lemaitre
> INRIA Saclay - Parietal team
> Center for Data Science Paris-Saclay
> https://glemaitre.github.io/
> _______________________________________________
> Wekalist mailing list
> Send posts to: [hidden email]
> List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
> List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: SMOTE - sample generation

glemaitre
Sorry for the late answer,

After contacted N. Chawla, it seems that his implementation was used to
generate in the hypercube. However, it does not recall what was the impact
of choosing one method or another.

Thus, I started to make a benchmark on the dataset used in the papaer. It is
still in progress but I'll be happy to share the results:
https://github.com/I2Cvb/smote_exp

On the side, I thought a bit more about this problem and it seems that the
generating samples in the hyperplane would make much more sense than in the
hypercube. The intuition behind generating sample in the hyperplane is that
you ensure to generate sample which belong to the subspace and manifold than
the sample used to generate this new sample. This will not hold if you
generate a sample in the hypercube. Another way to express it is that a
sample generated on the hyperplane is a linear combination of the samples
selected while the generating a sample in the hypercube does not enforce
such relationship.



--
Sent from: http://weka.8497.n7.nabble.com/
_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: SMOTE - sample generation

Eibe Frank-2
Administrator
Thanks for sharing the results. The raw results are a bit hard to look through and compare. You don’t have a summary by any chance, do you?

Cheers,
Eibe

> On 17/11/2018, at 2:30 AM, glemaitre <[hidden email]> wrote:
>
> Sorry for the late answer,
>
> After contacted N. Chawla, it seems that his implementation was used to
> generate in the hypercube. However, it does not recall what was the impact
> of choosing one method or another.
>
> Thus, I started to make a benchmark on the dataset used in the papaer. It is
> still in progress but I'll be happy to share the results:
> https://github.com/I2Cvb/smote_exp
>
> On the side, I thought a bit more about this problem and it seems that the
> generating samples in the hyperplane would make much more sense than in the
> hypercube. The intuition behind generating sample in the hyperplane is that
> you ensure to generate sample which belong to the subspace and manifold than
> the sample used to generate this new sample. This will not hold if you
> generate a sample in the hypercube. Another way to express it is that a
> sample generated on the hyperplane is a linear combination of the samples
> selected while the generating a sample in the hypercube does not enforce
> such relationship.
>
>
>
> --
> Sent from: http://weka.8497.n7.nabble.com/
> _______________________________________________
> Wekalist mailing list
> Send posts to: [hidden email]
> To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
> List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: SMOTE - sample generation

glemaitre
Oh this is still in progress.I'll do a summary at the end.

On Sun, 18 Nov 2018 at 23:47, Eibe Frank <[hidden email]> wrote:
Thanks for sharing the results. The raw results are a bit hard to look through and compare. You don’t have a summary by any chance, do you?

Cheers,
Eibe

> On 17/11/2018, at 2:30 AM, glemaitre <[hidden email]> wrote:
>
> Sorry for the late answer,
>
> After contacted N. Chawla, it seems that his implementation was used to
> generate in the hypercube. However, it does not recall what was the impact
> of choosing one method or another.
>
> Thus, I started to make a benchmark on the dataset used in the papaer. It is
> still in progress but I'll be happy to share the results:
> https://github.com/I2Cvb/smote_exp
>
> On the side, I thought a bit more about this problem and it seems that the
> generating samples in the hyperplane would make much more sense than in the
> hypercube. The intuition behind generating sample in the hyperplane is that
> you ensure to generate sample which belong to the subspace and manifold than
> the sample used to generate this new sample. This will not hold if you
> generate a sample in the hypercube. Another way to express it is that a
> sample generated on the hyperplane is a linear combination of the samples
> selected while the generating a sample in the hypercube does not enforce
> such relationship.
>
>
>
> --
> Sent from: http://weka.8497.n7.nabble.com/
> _______________________________________________
> Wekalist mailing list
> Send posts to: [hidden email]
> To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
> List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html


--
Guillaume Lemaitre
INRIA Saclay - Parietal team
Center for Data Science Paris-Saclay
https://glemaitre.github.io/

_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html