FS with different folds of CV

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

FS with different folds of CV

neha.bologna
Hi Peter, I hope you are fine.

The main aim of my project is to check the consistency of various FS algorithms. One option is to check which metrics FS algorithms select when we use different seeds randomly, taking into account we used the same fold CV. Another option is to assess the consistency of FS when we change the fold number i..e 10 fold, 5 fold, 3 fold etc. 

Could you please guide me what other options are available when we change a little bit the training sample so that we evaluate the consistency of FS algorithms?

Warm regards

Virus-free. www.avast.com

_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: FS with different folds of CV

Peter Reutemann
> The main aim of my project is to check the consistency of various FS algorithms. One option is to check which metrics FS algorithms select when we use different seeds randomly, taking into account we used the same fold CV. Another option is to assess the consistency of FS when we change the fold number i..e 10 fold, 5 fold, 3 fold etc.
>
> Could you please guide me what other options are available when we change a little bit the training sample so that we evaluate the consistency of FS algorithms?

You could randomly remove instances from your dataset (Randomize
filter followed by a RemoveRange or RemovePercentage) or add noise
(AddNoise filter):
https://weka.sourceforge.io/doc.dev/weka/filters/unsupervised/instance/RemoveRange.html
https://weka.sourceforge.io/doc.dev/weka/filters/unsupervised/instance/RemovePercentage.html
https://weka.sourceforge.io/doc.dev/weka/filters/unsupervised/attribute/AddNoise.html

I don't use any feature selection methods, so not sure whether
introducing missing values (or replacing them) has any impact:
https://github.com/fracpete/missing-values-imputation-weka-package

Cheers, Peter
--
Peter Reutemann
Dept. of Computer Science
University of Waikato, NZ
+64 (7) 858-5174
http://www.cms.waikato.ac.nz/~fracpete/
http://www.data-mining.co.nz/
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: FS with different folds of CV

neha.bologna
Thank you Peter for your feedback. 

Actually I was also thinking about removing some instances randomly. First I thought I should remove the outliers but it will not be reliable as some datasets have very few outliers, others have a lot. Remove via percentage would be a better idea.

Warm regards

Virus-free. www.avast.com

On Tue, Jun 30, 2020 at 1:09 AM Peter Reutemann <[hidden email]> wrote:
> The main aim of my project is to check the consistency of various FS algorithms. One option is to check which metrics FS algorithms select when we use different seeds randomly, taking into account we used the same fold CV. Another option is to assess the consistency of FS when we change the fold number i..e 10 fold, 5 fold, 3 fold etc.
>
> Could you please guide me what other options are available when we change a little bit the training sample so that we evaluate the consistency of FS algorithms?

You could randomly remove instances from your dataset (Randomize
filter followed by a RemoveRange or RemovePercentage) or add noise
(AddNoise filter):
https://weka.sourceforge.io/doc.dev/weka/filters/unsupervised/instance/RemoveRange.html
https://weka.sourceforge.io/doc.dev/weka/filters/unsupervised/instance/RemovePercentage.html
https://weka.sourceforge.io/doc.dev/weka/filters/unsupervised/attribute/AddNoise.html

I don't use any feature selection methods, so not sure whether
introducing missing values (or replacing them) has any impact:
https://github.com/fracpete/missing-values-imputation-weka-package

Cheers, Peter
--
Peter Reutemann
Dept. of Computer Science
University of Waikato, NZ
+64 (7) 858-5174
http://www.cms.waikato.ac.nz/~fracpete/
http://www.data-mining.co.nz/
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: FS with different folds of CV

neha.bologna
Hi Peter, I just checked the consistency of FS algorithms with different validation methods i..e 3 fold, 5 fold, 10 fold, LOOCV and percentage split. What I found is all FS select the same metrics regardless of the validation methods. I mean the set of metrics selected by GA with k fold CV are also selected via LOOCV. Does it mean the validation methods have low influence on the consistency of metrics selection?  Same is the case with using different/random seeds.

With removing some instances, however, the selection of metrics by different FS algorithms are different. 

Warm regards

Virus-free. www.avast.com

On Tue, Jun 30, 2020 at 1:15 AM Neha gupta <[hidden email]> wrote:
Thank you Peter for your feedback. 

Actually I was also thinking about removing some instances randomly. First I thought I should remove the outliers but it will not be reliable as some datasets have very few outliers, others have a lot. Remove via percentage would be a better idea.

Warm regards

Virus-free. www.avast.com

On Tue, Jun 30, 2020 at 1:09 AM Peter Reutemann <[hidden email]> wrote:
> The main aim of my project is to check the consistency of various FS algorithms. One option is to check which metrics FS algorithms select when we use different seeds randomly, taking into account we used the same fold CV. Another option is to assess the consistency of FS when we change the fold number i..e 10 fold, 5 fold, 3 fold etc.
>
> Could you please guide me what other options are available when we change a little bit the training sample so that we evaluate the consistency of FS algorithms?

You could randomly remove instances from your dataset (Randomize
filter followed by a RemoveRange or RemovePercentage) or add noise
(AddNoise filter):
https://weka.sourceforge.io/doc.dev/weka/filters/unsupervised/instance/RemoveRange.html
https://weka.sourceforge.io/doc.dev/weka/filters/unsupervised/instance/RemovePercentage.html
https://weka.sourceforge.io/doc.dev/weka/filters/unsupervised/attribute/AddNoise.html

I don't use any feature selection methods, so not sure whether
introducing missing values (or replacing them) has any impact:
https://github.com/fracpete/missing-values-imputation-weka-package

Cheers, Peter
--
Peter Reutemann
Dept. of Computer Science
University of Waikato, NZ
+64 (7) 858-5174
http://www.cms.waikato.ac.nz/~fracpete/
http://www.data-mining.co.nz/
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: FS with different folds of CV

Peter Reutemann
> Hi Peter, I just checked the consistency of FS algorithms with different validation methods i..e 3 fold, 5 fold, 10 fold, LOOCV and percentage split. What I found is all FS select the same metrics regardless of the validation methods. I mean the set of metrics selected by GA with k fold CV are also selected via LOOCV. Does it mean the validation methods have low influence on the consistency of metrics selection?  Same is the case with using different/random seeds.
>
> With removing some instances, however, the selection of metrics by different FS algorithms are different.

k-fold CV (apart from LOOCV) attempts to recreate the same
distribution of labels in each fold pair, which could explain the
minimal difference between difference values of k.
Removing instances may change the label distribution and therefore
have a different impact on the chosen attributes.

Disclaimer: I don't use attribute selection.

Cheers, Peter
--
Peter Reutemann
Dept. of Computer Science
University of Waikato, NZ
+64 (7) 858-5174
http://www.cms.waikato.ac.nz/~fracpete/
http://www.data-mining.co.nz/
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: FS with different folds of CV

neha.bologna
OK thanks a lot Peter for your time. 

Warm regards 

On Wednesday, July 1, 2020, Peter Reutemann <[hidden email]> wrote:
> Hi Peter, I just checked the consistency of FS algorithms with different validation methods i..e 3 fold, 5 fold, 10 fold, LOOCV and percentage split. What I found is all FS select the same metrics regardless of the validation methods. I mean the set of metrics selected by GA with k fold CV are also selected via LOOCV. Does it mean the validation methods have low influence on the consistency of metrics selection?  Same is the case with using different/random seeds.
>
> With removing some instances, however, the selection of metrics by different FS algorithms are different.

k-fold CV (apart from LOOCV) attempts to recreate the same
distribution of labels in each fold pair, which could explain the
minimal difference between difference values of k.
Removing instances may change the label distribution and therefore
have a different impact on the chosen attributes.

Disclaimer: I don't use attribute selection.

Cheers, Peter
--
Peter Reutemann
Dept. of Computer Science
University of Waikato, NZ
+64 (7) 858-5174
http://www.cms.waikato.ac.nz/~fracpete/
http://www.data-mining.co.nz/
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html