Balancing datasets

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Balancing datasets

asadbtk
Hi Eibe and Peter

I have a dataset with total observations/instances of 10,000 (i.e. 2000 True values and 8000 False values). I want to select only a subset of observations so that we have 60 True values observations/instances and 240 False values. 

My question is there any automatic/sophisticated way of selecting this subset or do we have to do it manually, which I think would be a time consuming process.

Thanks 

_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Balancing datasets

Peter Reutemann-3
You could try the supervised Resample filter:

https://weka.sourceforge.io/doc.dev/weka/filters/supervised/instance/Resample.html

Cheers, Peter

On August 1, 2020 8:08:47 PM GMT+12:00, javed khan <[hidden email]> wrote:

>Hi Eibe and Peter
>
>I have a dataset with total observations/instances of 10,000 (i.e. 2000
>True values and 8000 False values). I want to select only a subset of
>observations so that we have 60 True values observations/instances and
>240
>False values.
>
>My question is there any automatic/sophisticated way of selecting this
>subset or do we have to do it manually, which I think would be a time
>consuming process.
>
>Thanks

--
Peter Reutemann
Dept. of Computer Science
University of Waikato, NZ
+64 (7) 858-5174
http://www.cms.waikato.ac.nz/~fracpete/
http://www.data-mining.co.nz/
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Balancing datasets

asadbtk
Hi Peter and thanks for your reply.

How can we select a fixed number of True and False instances from the Resample filter? I do not see any option there, am I wrong?

Best regards

On Fri, Jul 31, 2020 at 5:14 PM Peter Reutemann <[hidden email]> wrote:
You could try the supervised Resample filter:

https://weka.sourceforge.io/doc.dev/weka/filters/supervised/instance/Resample.html

Cheers, Peter

On August 1, 2020 8:08:47 PM GMT+12:00, javed khan <[hidden email]> wrote:
>Hi Eibe and Peter
>
>I have a dataset with total observations/instances of 10,000 (i.e. 2000
>True values and 8000 False values). I want to select only a subset of
>observations so that we have 60 True values observations/instances and
>240
>False values.
>
>My question is there any automatic/sophisticated way of selecting this
>subset or do we have to do it manually, which I think would be a time
>consuming process.
>
>Thanks

--
Peter Reutemann
Dept. of Computer Science
University of Waikato, NZ
+64 (7) 858-5174
http://www.cms.waikato.ac.nz/~fracpete/
http://www.data-mining.co.nz/
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Balancing datasets

Peter Reutemann-3
The bias factor determines whether the output distribution is the same as the input or uniform. With the percentage you determine the number of output instances.

Cheers, Peter

On August 1, 2020 8:41:54 PM GMT+12:00, javed khan <[hidden email]> wrote:

>Hi Peter and thanks for your reply.
>
>How can we select a fixed number of True and False instances from the
>Resample filter? I do not see any option there, am I wrong?
>
>Best regards
>
>On Fri, Jul 31, 2020 at 5:14 PM Peter Reutemann <[hidden email]>
>wrote:
>
>> You could try the supervised Resample filter:
>>
>>
>>
>https://weka.sourceforge.io/doc.dev/weka/filters/supervised/instance/Resample.html
>>
>> Cheers, Peter
>>
>> On August 1, 2020 8:08:47 PM GMT+12:00, javed khan
><[hidden email]>
>> wrote:
>> >Hi Eibe and Peter
>> >
>> >I have a dataset with total observations/instances of 10,000 (i.e.
>2000
>> >True values and 8000 False values). I want to select only a subset
>of
>> >observations so that we have 60 True values observations/instances
>and
>> >240
>> >False values.
>> >
>> >My question is there any automatic/sophisticated way of selecting
>this
>> >subset or do we have to do it manually, which I think would be a
>time
>> >consuming process.
>> >
>> >Thanks
>>
>> --
>> Peter Reutemann
>> Dept. of Computer Science
>> University of Waikato, NZ
>> +64 (7) 858-5174
>> http://www.cms.waikato.ac.nz/~fracpete/
>> http://www.data-mining.co.nz/
>> _______________________________________________
>> Wekalist mailing list -- [hidden email]
>> Send posts to [hidden email]
>> To unsubscribe send an email to [hidden email]
>> To subscribe, unsubscribe, etc., visit
>>
>https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
>> List etiquette:
>> http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
>>

--
Peter Reutemann
Dept. of Computer Science
University of Waikato, NZ
+64 (7) 858-5174
http://www.cms.waikato.ac.nz/~fracpete/
http://www.data-mining.co.nz/
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Balancing datasets

asadbtk
Hi Peter thanks for the feedback.

However, if I use 1 in the bias factor, the True and False values will be of the same type i.e. 150 True and 150 False.. I want values which include 60 True values and 240 False values. Can I do that?

Best regards



On Sat, Aug 1, 2020 at 2:06 AM Peter Reutemann <[hidden email]> wrote:
The bias factor determines whether the output distribution is the same as the input or uniform. With the percentage you determine the number of output instances.

Cheers, Peter

On August 1, 2020 8:41:54 PM GMT+12:00, javed khan <[hidden email]> wrote:
>Hi Peter and thanks for your reply.
>
>How can we select a fixed number of True and False instances from the
>Resample filter? I do not see any option there, am I wrong?
>
>Best regards
>
>On Fri, Jul 31, 2020 at 5:14 PM Peter Reutemann <[hidden email]>
>wrote:
>
>> You could try the supervised Resample filter:
>>
>>
>>
>https://weka.sourceforge.io/doc.dev/weka/filters/supervised/instance/Resample.html
>>
>> Cheers, Peter
>>
>> On August 1, 2020 8:08:47 PM GMT+12:00, javed khan
><[hidden email]>
>> wrote:
>> >Hi Eibe and Peter
>> >
>> >I have a dataset with total observations/instances of 10,000 (i.e.
>2000
>> >True values and 8000 False values). I want to select only a subset
>of
>> >observations so that we have 60 True values observations/instances
>and
>> >240
>> >False values.
>> >
>> >My question is there any automatic/sophisticated way of selecting
>this
>> >subset or do we have to do it manually, which I think would be a
>time
>> >consuming process.
>> >
>> >Thanks
>>
>> --
>> Peter Reutemann
>> Dept. of Computer Science
>> University of Waikato, NZ
>> +64 (7) 858-5174
>> http://www.cms.waikato.ac.nz/~fracpete/
>> http://www.data-mining.co.nz/
>> _______________________________________________
>> Wekalist mailing list -- [hidden email]
>> Send posts to [hidden email]
>> To unsubscribe send an email to [hidden email]
>> To subscribe, unsubscribe, etc., visit
>>
>https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
>> List etiquette:
>> http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
>>

--
Peter Reutemann
Dept. of Computer Science
University of Waikato, NZ
+64 (7) 858-5174
http://www.cms.waikato.ac.nz/~fracpete/
http://www.data-mining.co.nz/
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Balancing datasets

asadbtk
I just wonder if it is possible to use the SMOTE and increase/decrease the ratio of majority and minority class values so that it becomes in the proportion of 240:60.

Best regards

On Sat, Aug 1, 2020 at 11:41 AM javed khan <[hidden email]> wrote:
Hi Peter thanks for the feedback.

However, if I use 1 in the bias factor, the True and False values will be of the same type i.e. 150 True and 150 False.. I want values which include 60 True values and 240 False values. Can I do that?

Best regards



On Sat, Aug 1, 2020 at 2:06 AM Peter Reutemann <[hidden email]> wrote:
The bias factor determines whether the output distribution is the same as the input or uniform. With the percentage you determine the number of output instances.

Cheers, Peter

On August 1, 2020 8:41:54 PM GMT+12:00, javed khan <[hidden email]> wrote:
>Hi Peter and thanks for your reply.
>
>How can we select a fixed number of True and False instances from the
>Resample filter? I do not see any option there, am I wrong?
>
>Best regards
>
>On Fri, Jul 31, 2020 at 5:14 PM Peter Reutemann <[hidden email]>
>wrote:
>
>> You could try the supervised Resample filter:
>>
>>
>>
>https://weka.sourceforge.io/doc.dev/weka/filters/supervised/instance/Resample.html
>>
>> Cheers, Peter
>>
>> On August 1, 2020 8:08:47 PM GMT+12:00, javed khan
><[hidden email]>
>> wrote:
>> >Hi Eibe and Peter
>> >
>> >I have a dataset with total observations/instances of 10,000 (i.e.
>2000
>> >True values and 8000 False values). I want to select only a subset
>of
>> >observations so that we have 60 True values observations/instances
>and
>> >240
>> >False values.
>> >
>> >My question is there any automatic/sophisticated way of selecting
>this
>> >subset or do we have to do it manually, which I think would be a
>time
>> >consuming process.
>> >
>> >Thanks
>>
>> --
>> Peter Reutemann
>> Dept. of Computer Science
>> University of Waikato, NZ
>> +64 (7) 858-5174
>> http://www.cms.waikato.ac.nz/~fracpete/
>> http://www.data-mining.co.nz/
>> _______________________________________________
>> Wekalist mailing list -- [hidden email]
>> Send posts to [hidden email]
>> To unsubscribe send an email to [hidden email]
>> To subscribe, unsubscribe, etc., visit
>>
>https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
>> List etiquette:
>> http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
>>

--
Peter Reutemann
Dept. of Computer Science
University of Waikato, NZ
+64 (7) 858-5174
http://www.cms.waikato.ac.nz/~fracpete/
http://www.data-mining.co.nz/
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Balancing datasets

Peter Reutemann
In reply to this post by asadbtk
> However, if I use 1 in the bias factor, the True and False values will be of the same type i.e. 150 True and 150 False.. I want values which include 60 True values and 240 False values. Can I do that?

I'm confused... Your input data has a ratio of 1:4 for the two
classes, the same that you want in the output data. Why did you change
the bias factor from the default 0, which will maintain the same
distribution?

Cheers, Peter
--
Peter Reutemann
Dept. of Computer Science
University of Waikato, NZ
+64 (7) 858-5174
http://www.cms.waikato.ac.nz/~fracpete/
http://www.data-mining.co.nz/
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Balancing datasets

asadbtk
Hello Peter

I am sorry Peter that I was unable to explain my question properly.. I confess that it was a confused question; however, it is now resolved.

Thanks for your time.

Best regards


On Sat, Aug 1, 2020 at 10:18 PM Peter Reutemann <[hidden email]> wrote:
> However, if I use 1 in the bias factor, the True and False values will be of the same type i.e. 150 True and 150 False.. I want values which include 60 True values and 240 False values. Can I do that?

I'm confused... Your input data has a ratio of 1:4 for the two
classes, the same that you want in the output data. Why did you change
the bias factor from the default 0, which will maintain the same
distribution?

Cheers, Peter
--
Peter Reutemann
Dept. of Computer Science
University of Waikato, NZ
+64 (7) 858-5174
http://www.cms.waikato.ac.nz/~fracpete/
http://www.data-mining.co.nz/
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html