|
|
Hi
I intend to use hyperparameter settings using grid search and random search. As you know, it usually takes a lot of time to process. My question is is it a wise decision to clean the datasets first before using the hyperparameter settings? I mean if I perform some preprocessing steps like feature selection and removal of outliers, is it a good idea or it will lead to some biases in the results?
My second question regarding this issue is that do the cross validation like 10 fold, 5 fold etc affect the performance of the tuned machine learning models?
Best regards
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nzList etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
|
|
If you have a separate test set that you will use for your final evaluation, you can do whatever you like with the training data. The key is that you must not use information from the test set to inform preprocessing, hyperparameter tuning, or model building.
So, for example, using WrapperSubsetEval to select attributes using cross-validation, discarding those attributes, and then performing a cross-validation, or train/test split on the modified data (whether in WEKA or R) is inappropriate and will introduce bias.
Regarding your second question: in a 5-fold CV, only 80% of the data is available for training each of the 5 per-fold models; in a 10-fold CV, 90% of the data is available for this. There is a chance that different models will be chosen because of this (and also because CV is a non-deterministic process whose exact outcome depends on exactly how the data is randomly split).
Cheers, Eibe
Hi
I intend to use hyperparameter settings using grid search and random search. As you know, it usually takes a lot of time to process. My question is is it a wise decision to clean the datasets first before using the hyperparameter settings? I mean if I perform some preprocessing steps like feature selection and removal of outliers, is it a good idea or it will lead to some biases in the results?
My second question regarding this issue is that do the cross validation like 10 fold, 5 fold etc affect the performance of the tuned machine learning models?
Best regards
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nzList etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
|
|
Hello Eibe, thanks a lot.
In r, yes I use createDataPartirion to split the datasets into training and testing set.
For weka, if I use meta classifier tab, and select features selections as well as hyperparameter tuning from there, do you think that would be fine?
Best regards On Sunday, November 24, 2019, Eibe Frank < [hidden email]> wrote: If you have a separate test set that you will use for your final evaluation, you can do whatever you like with the training data. The key is that you must not use information from the test set to inform preprocessing, hyperparameter tuning, or model building.
So, for example, using WrapperSubsetEval to select attributes using cross-validation, discarding those attributes, and then performing a cross-validation, or train/test split on the modified data (whether in WEKA or R) is inappropriate and will introduce bias.
Regarding your second question: in a 5-fold CV, only 80% of the data is available for training each of the 5 per-fold models; in a 10-fold CV, 90% of the data is available for this. There is a chance that different models will be chosen because of this (and also because CV is a non-deterministic process whose exact outcome depends on exactly how the data is randomly split).
Cheers, Eibe
Hi
I intend to use hyperparameter settings using grid search and random search. As you know, it usually takes a lot of time to process. My question is is it a wise decision to clean the datasets first before using the hyperparameter settings? I mean if I perform some preprocessing steps like feature selection and removal of outliers, is it a good idea or it will lead to some biases in the results?
My second question regarding this issue is that do the cross validation like 10 fold, 5 fold etc affect the performance of the tuned machine learning models?
Best regards
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nzList etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
|
|
Yes, if you integrate all preprocessing, parameter tuning, and attribute selection directly into the process of learning the classification model by configuring appropriate meta classifiers to do so (e.g., MultiSearch, AttributeSelectedClassifier, and FilteredClassifier), you will be on the safe side; this will ensure that the test data is not used to inform the model, regardless of the evaluation method you use (e.g., k-fold cross-validation).
Cheers, Eibe
Hello Eibe, thanks a lot.
In r, yes I use createDataPartirion to split the datasets into training and testing set.
For weka, if I use meta classifier tab, and select features selections as well as hyperparameter tuning from there, do you think that would be fine?
Best regards On Sunday, November 24, 2019, Eibe Frank < [hidden email]> wrote: If you have a separate test set that you will use for your final evaluation, you can do whatever you like with the training data. The key is that you must not use information from the test set to inform preprocessing, hyperparameter tuning, or model building.
So, for example, using WrapperSubsetEval to select attributes using cross-validation, discarding those attributes, and then performing a cross-validation, or train/test split on the modified data (whether in WEKA or R) is inappropriate and will introduce bias.
Regarding your second question: in a 5-fold CV, only 80% of the data is available for training each of the 5 per-fold models; in a 10-fold CV, 90% of the data is available for this. There is a chance that different models will be chosen because of this (and also because CV is a non-deterministic process whose exact outcome depends on exactly how the data is randomly split).
Cheers, Eibe
Hi
I intend to use hyperparameter settings using grid search and random search. As you know, it usually takes a lot of time to process. My question is is it a wise decision to clean the datasets first before using the hyperparameter settings? I mean if I perform some preprocessing steps like feature selection and removal of outliers, is it a good idea or it will lead to some biases in the results?
My second question regarding this issue is that do the cross validation like 10 fold, 5 fold etc affect the performance of the tuned machine learning models?
Best regards
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nzList etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
|
|