Hyperparameter tuning with clean data

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Hyperparameter tuning with clean data

asadbtk
Hi

I intend to use hyperparameter settings using grid search and random search. As you know, it usually takes a lot of time to process. My question is is it a wise decision to clean the datasets first before using the hyperparameter settings? I mean if I perform some preprocessing steps like feature selection and removal of outliers, is it a good idea or it will lead to some biases in the results? 

My second question regarding this issue is that do the cross validation like 10 fold, 5 fold etc affect the performance of the tuned machine learning models? 

Best regards 

_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Hyperparameter tuning with clean data

Eibe Frank
If you have a separate test set that you will use for your final evaluation, you can do whatever you like with the training data. The key is that you must not use information from the test set to inform preprocessing, hyperparameter tuning, or model building.

So, for example, using WrapperSubsetEval to select attributes using cross-validation, discarding those attributes, and then performing a cross-validation, or train/test split on the modified data (whether in WEKA or R) is inappropriate and will introduce bias.

Regarding your second question: in a 5-fold CV, only 80% of the data is available for training each of the 5 per-fold models; in a 10-fold CV, 90% of the data is available for this. There is a chance that different models will be chosen because of this (and also because CV is a non-deterministic process whose exact outcome depends on exactly how the data is randomly split).

Cheers,
Eibe

On Thu, Nov 21, 2019 at 10:48 AM javed khan <[hidden email]> wrote:
Hi

I intend to use hyperparameter settings using grid search and random search. As you know, it usually takes a lot of time to process. My question is is it a wise decision to clean the datasets first before using the hyperparameter settings? I mean if I perform some preprocessing steps like feature selection and removal of outliers, is it a good idea or it will lead to some biases in the results? 

My second question regarding this issue is that do the cross validation like 10 fold, 5 fold etc affect the performance of the tuned machine learning models? 

Best regards 
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Hyperparameter tuning with clean data

asadbtk
Hello Eibe, thanks a lot. 

In r, yes I use createDataPartirion to split the datasets into training and testing set. 

For weka, if I use meta classifier tab, and select features selections as well as hyperparameter tuning from there, do you think that would be fine? 

Best regards 


On Sunday, November 24, 2019, Eibe Frank <[hidden email]> wrote:
If you have a separate test set that you will use for your final evaluation, you can do whatever you like with the training data. The key is that you must not use information from the test set to inform preprocessing, hyperparameter tuning, or model building.

So, for example, using WrapperSubsetEval to select attributes using cross-validation, discarding those attributes, and then performing a cross-validation, or train/test split on the modified data (whether in WEKA or R) is inappropriate and will introduce bias.

Regarding your second question: in a 5-fold CV, only 80% of the data is available for training each of the 5 per-fold models; in a 10-fold CV, 90% of the data is available for this. There is a chance that different models will be chosen because of this (and also because CV is a non-deterministic process whose exact outcome depends on exactly how the data is randomly split).

Cheers,
Eibe

On Thu, Nov 21, 2019 at 10:48 AM javed khan <[hidden email]> wrote:
Hi

I intend to use hyperparameter settings using grid search and random search. As you know, it usually takes a lot of time to process. My question is is it a wise decision to clean the datasets first before using the hyperparameter settings? I mean if I perform some preprocessing steps like feature selection and removal of outliers, is it a good idea or it will lead to some biases in the results? 

My second question regarding this issue is that do the cross validation like 10 fold, 5 fold etc affect the performance of the tuned machine learning models? 

Best regards 
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Hyperparameter tuning with clean data

Eibe Frank
Yes, if you integrate all preprocessing, parameter tuning, and attribute selection directly into the process of learning the classification model by configuring appropriate meta classifiers to do so (e.g., MultiSearch, AttributeSelectedClassifier, and FilteredClassifier), you will be on the safe side; this will ensure that the test data is not used to inform the model, regardless of the evaluation method you use (e.g., k-fold cross-validation).

Cheers,
Eibe

On Mon, Nov 25, 2019 at 12:05 AM javed khan <[hidden email]> wrote:
Hello Eibe, thanks a lot. 

In r, yes I use createDataPartirion to split the datasets into training and testing set. 

For weka, if I use meta classifier tab, and select features selections as well as hyperparameter tuning from there, do you think that would be fine? 

Best regards 


On Sunday, November 24, 2019, Eibe Frank <[hidden email]> wrote:
If you have a separate test set that you will use for your final evaluation, you can do whatever you like with the training data. The key is that you must not use information from the test set to inform preprocessing, hyperparameter tuning, or model building.

So, for example, using WrapperSubsetEval to select attributes using cross-validation, discarding those attributes, and then performing a cross-validation, or train/test split on the modified data (whether in WEKA or R) is inappropriate and will introduce bias.

Regarding your second question: in a 5-fold CV, only 80% of the data is available for training each of the 5 per-fold models; in a 10-fold CV, 90% of the data is available for this. There is a chance that different models will be chosen because of this (and also because CV is a non-deterministic process whose exact outcome depends on exactly how the data is randomly split).

Cheers,
Eibe

On Thu, Nov 21, 2019 at 10:48 AM javed khan <[hidden email]> wrote:
Hi

I intend to use hyperparameter settings using grid search and random search. As you know, it usually takes a lot of time to process. My question is is it a wise decision to clean the datasets first before using the hyperparameter settings? I mean if I perform some preprocessing steps like feature selection and removal of outliers, is it a good idea or it will lead to some biases in the results? 

My second question regarding this issue is that do the cross validation like 10 fold, 5 fold etc affect the performance of the tuned machine learning models? 

Best regards 
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html