How and Why to do internal CV

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

How and Why to do internal CV

asadbtk
Hi, I have read these lines from an article, I dont to know why it two times 5 fold CV? How they perform CV two times, because usually we separate the train and test data, provide the test data as "Supplied test set" where the Cross-Validation button is disabled. My question is: (a) How to do what they have done?  (b) Why it is biased if we dont follow it?
The text from the paper is below:

We split the datasets into training and testing sets using fivefold cross validation, to assess their predictive accuracy. Within each fold, we used another internal five-fold cross validation for model selection, to avoid the potential bias of training and testing models on the same dataset. 

Thanks again

_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: How and Why to do internal CV

Eibe Frank-2
Administrator
There are several ways to perform parameter tuning using internal cross-validation in WEKA:

* CVParameterSelection: quite limited in what types of parameters can be optimised.

* GridSearch: can optimise two numeric parameters at a time using grid search; the grid can be automatically extended if necessary.

* MultiSearch: random or grid search on any number of parameters; very flexible.

MultiSearch is the most modern way to perform parameter tuning in WEKA.

There are also some feature selection methods in WEKA that perform internal cross-validation to decide on the set of features to use, e.g., WrapperSubsetEval.

The key observation is that you *must not* use the test data that is used to obtain your final estimate of predictive performance to influence the process of finding a model. If you start to peek at (performance on) the test data to influence the way the model is chosen, you will risk introducing bias. The extreme case if you fit the model to the test data: in that case, the performance estimate obtained on this data will be useless.

This also applies if you use k-fold cross-validation or a similar method to obtain multiple subsets for testing (and training). The test sets may not be used for adjusting the model. To this end, if you apply something like CVParameterSelection in a(n external) 10-fold cross-validation in the Classify Panel of the Explorer, internal 10-fold cross-validation will be applied on each of the 10 training sets of the external cross-validation. The test sets of the external cross-validation will not be considered when choosing parameter values.

Cheers,
Eibe

> On 20/09/2019, at 11:16 PM, javed khan <[hidden email]> wrote:
>
> Hi, I have read these lines from an article, I dont to know why it two times 5 fold CV? How they perform CV two times, because usually we separate the train and test data, provide the test data as "Supplied test set" where the Cross-Validation button is disabled. My question is: (a) How to do what they have done?  (b) Why it is biased if we dont follow it?
> The text from the paper is below:
>
> We split the datasets into training and testing sets using fivefold cross validation, to assess their predictive accuracy. Within each fold, we used another internal five-fold cross validation for model selection, to avoid the potential bias of training and testing models on the same dataset.
>
> Thanks again
> _______________________________________________
> Wekalist mailing list -- [hidden email]
> Send posts to: To unsubscribe send an email to [hidden email]
> To subscribe, unsubscribe, etc., visit
> https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
> List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: How and Why to do internal CV

asadbtk
Thanks Eibe for your detailed reply. 

I am explaining my the steps I followed, so want to ask if it is fine (unbiased). 

I separated train and test data, apply smote on train data for class balancing. Then I used attributeSelectedClassifier, set a particular feature selection algorithm and classifier. The test data I uploaded separately. Kindly Eibe, if you could comment? 

Thanks a lot 

On Saturday, September 21, 2019, Eibe Frank <[hidden email]> wrote:
There are several ways to perform parameter tuning using internal cross-validation in WEKA:

* CVParameterSelection: quite limited in what types of parameters can be optimised.

* GridSearch: can optimise two numeric parameters at a time using grid search; the grid can be automatically extended if necessary.

* MultiSearch: random or grid search on any number of parameters; very flexible.

MultiSearch is the most modern way to perform parameter tuning in WEKA.

There are also some feature selection methods in WEKA that perform internal cross-validation to decide on the set of features to use, e.g., WrapperSubsetEval.

The key observation is that you *must not* use the test data that is used to obtain your final estimate of predictive performance to influence the process of finding a model. If you start to peek at (performance on) the test data to influence the way the model is chosen, you will risk introducing bias. The extreme case if you fit the model to the test data: in that case, the performance estimate obtained on this data will be useless.

This also applies if you use k-fold cross-validation or a similar method to obtain multiple subsets for testing (and training). The test sets may not be used for adjusting the model. To this end, if you apply something like CVParameterSelection in a(n external) 10-fold cross-validation in the Classify Panel of the Explorer, internal 10-fold cross-validation will be applied on each of the 10 training sets of the external cross-validation. The test sets of the external cross-validation will not be considered when choosing parameter values.

Cheers,
Eibe

> On 20/09/2019, at 11:16 PM, javed khan <[hidden email]> wrote:
>
> Hi, I have read these lines from an article, I dont to know why it two times 5 fold CV? How they perform CV two times, because usually we separate the train and test data, provide the test data as "Supplied test set" where the Cross-Validation button is disabled. My question is: (a) How to do what they have done?  (b) Why it is biased if we dont follow it?
> The text from the paper is below:
>
> We split the datasets into training and testing sets using fivefold cross validation, to assess their predictive accuracy. Within each fold, we used another internal five-fold cross validation for model selection, to avoid the potential bias of training and testing models on the same dataset.
>
> Thanks again
> _______________________________________________
> Wekalist mailing list -- [hidden email]
> Send posts to: To unsubscribe send an email to [hidden email]
> To subscribe, unsubscribe, etc., visit
> https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
> List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: How and Why to do internal CV

Eibe Frank
That sounds fine. The key is that you are not applying SMOTE to the test data and that you are selecting attributes based on the training data only (which is what you are implementing by applying AttributeSelectedClassifier).

Cheers,
Eibe

On Sat, 21 Sep 2019 at 9:53 PM, javed khan <[hidden email]> wrote:
Thanks Eibe for your detailed reply. 

I am explaining my the steps I followed, so want to ask if it is fine (unbiased). 

I separated train and test data, apply smote on train data for class balancing. Then I used attributeSelectedClassifier, set a particular feature selection algorithm and classifier. The test data I uploaded separately. Kindly Eibe, if you could comment? 

Thanks a lot 

On Saturday, September 21, 2019, Eibe Frank <[hidden email]> wrote:
There are several ways to perform parameter tuning using internal cross-validation in WEKA:

* CVParameterSelection: quite limited in what types of parameters can be optimised.

* GridSearch: can optimise two numeric parameters at a time using grid search; the grid can be automatically extended if necessary.

* MultiSearch: random or grid search on any number of parameters; very flexible.

MultiSearch is the most modern way to perform parameter tuning in WEKA.

There are also some feature selection methods in WEKA that perform internal cross-validation to decide on the set of features to use, e.g., WrapperSubsetEval.

The key observation is that you *must not* use the test data that is used to obtain your final estimate of predictive performance to influence the process of finding a model. If you start to peek at (performance on) the test data to influence the way the model is chosen, you will risk introducing bias. The extreme case if you fit the model to the test data: in that case, the performance estimate obtained on this data will be useless.

This also applies if you use k-fold cross-validation or a similar method to obtain multiple subsets for testing (and training). The test sets may not be used for adjusting the model. To this end, if you apply something like CVParameterSelection in a(n external) 10-fold cross-validation in the Classify Panel of the Explorer, internal 10-fold cross-validation will be applied on each of the 10 training sets of the external cross-validation. The test sets of the external cross-validation will not be considered when choosing parameter values.

Cheers,
Eibe

> On 20/09/2019, at 11:16 PM, javed khan <[hidden email]> wrote:
>
> Hi, I have read these lines from an article, I dont to know why it two times 5 fold CV? How they perform CV two times, because usually we separate the train and test data, provide the test data as "Supplied test set" where the Cross-Validation button is disabled. My question is: (a) How to do what they have done?  (b) Why it is biased if we dont follow it?
> The text from the paper is below:
>
> We split the datasets into training and testing sets using fivefold cross validation, to assess their predictive accuracy. Within each fold, we used another internal five-fold cross validation for model selection, to avoid the potential bias of training and testing models on the same dataset.
>
> Thanks again
> _______________________________________________
> Wekalist mailing list -- [hidden email]
> Send posts to: To unsubscribe send an email to [hidden email]
> To subscribe, unsubscribe, etc., visit
> https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
> List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: How and Why to do internal CV

asadbtk
Thanks for your guidance Eibe. 

Regards 

On Sunday, September 22, 2019, Eibe Frank <[hidden email]> wrote:
That sounds fine. The key is that you are not applying SMOTE to the test data and that you are selecting attributes based on the training data only (which is what you are implementing by applying AttributeSelectedClassifier).

Cheers,
Eibe

On Sat, 21 Sep 2019 at 9:53 PM, javed khan <[hidden email]> wrote:
Thanks Eibe for your detailed reply. 

I am explaining my the steps I followed, so want to ask if it is fine (unbiased). 

I separated train and test data, apply smote on train data for class balancing. Then I used attributeSelectedClassifier, set a particular feature selection algorithm and classifier. The test data I uploaded separately. Kindly Eibe, if you could comment? 

Thanks a lot 

On Saturday, September 21, 2019, Eibe Frank <[hidden email]> wrote:
There are several ways to perform parameter tuning using internal cross-validation in WEKA:

* CVParameterSelection: quite limited in what types of parameters can be optimised.

* GridSearch: can optimise two numeric parameters at a time using grid search; the grid can be automatically extended if necessary.

* MultiSearch: random or grid search on any number of parameters; very flexible.

MultiSearch is the most modern way to perform parameter tuning in WEKA.

There are also some feature selection methods in WEKA that perform internal cross-validation to decide on the set of features to use, e.g., WrapperSubsetEval.

The key observation is that you *must not* use the test data that is used to obtain your final estimate of predictive performance to influence the process of finding a model. If you start to peek at (performance on) the test data to influence the way the model is chosen, you will risk introducing bias. The extreme case if you fit the model to the test data: in that case, the performance estimate obtained on this data will be useless.

This also applies if you use k-fold cross-validation or a similar method to obtain multiple subsets for testing (and training). The test sets may not be used for adjusting the model. To this end, if you apply something like CVParameterSelection in a(n external) 10-fold cross-validation in the Classify Panel of the Explorer, internal 10-fold cross-validation will be applied on each of the 10 training sets of the external cross-validation. The test sets of the external cross-validation will not be considered when choosing parameter values.

Cheers,
Eibe

> On 20/09/2019, at 11:16 PM, javed khan <[hidden email]> wrote:
>
> Hi, I have read these lines from an article, I dont to know why it two times 5 fold CV? How they perform CV two times, because usually we separate the train and test data, provide the test data as "Supplied test set" where the Cross-Validation button is disabled. My question is: (a) How to do what they have done?  (b) Why it is biased if we dont follow it?
> The text from the paper is below:
>
> We split the datasets into training and testing sets using fivefold cross validation, to assess their predictive accuracy. Within each fold, we used another internal five-fold cross validation for model selection, to avoid the potential bias of training and testing models on the same dataset.
>
> Thanks again
> _______________________________________________
> Wekalist mailing list -- [hidden email]
> Send posts to: To unsubscribe send an email to [hidden email]
> To subscribe, unsubscribe, etc., visit
> https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
> List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html