I would like to hear from you some advice on the peculiarities of working
with data of small sets but with a large number of features (for example, 15 instances and 30 features). Unfortunately, with such data, the model result (regression, correlation coefficient) strongly depends (up to a sign change) on the folds of cross-validation when determining the training - test set. thanks in advance Anatoliy -- Sent from: https://weka.8497.n7.nabble.com/ _______________________________________________ Wekalist mailing list -- [hidden email] Send posts to [hidden email] To unsubscribe send an email to [hidden email] To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html |
Hi Anatoliy, Some people use a support vector machine (SVM) for the problem of too few cases, too many variables. WEKA has an SVM application. Here is a tutorial. bye for now, George On Mon, Nov 30, 2020 at 3:46 PM Anatoliy <[hidden email]> wrote: I would like to hear from you some advice on the peculiarities of working _______________________________________________ Wekalist mailing list -- [hidden email] Send posts to [hidden email] To unsubscribe send an email to [hidden email] To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html |
Hi George!
Great. Thanks. Yes, I read about the use of SVM. By the way, thanks for the link, a good resource. I am more interested in the question - what to do with the division into training / test set of such small data? Is there a principle for determining effective cross-validation folds? -- Sent from: https://weka.8497.n7.nabble.com/ _______________________________________________ Wekalist mailing list -- [hidden email] Send posts to [hidden email] To unsubscribe send an email to [hidden email] To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html |
Hi Anatoliy, Thanks for your email. I usually use 1/10 to 1/4 of the data for the training set. But with the problem of too few cases, I would be tempted to treat the whole data set as the training set. You can experiment with different splits in the testing/ training set ratio and see if it makes a significant difference. I'm not the best person to ask about how to determine the most effective cross-validation folds. I am under the assumption that folds require large data sets and you don't have one. So I hesitate to offer an opinion there. I do know that if you are looking at a known minority case you may have to over represent it in the training set or it will be ignored if it is too infrequent. This occurs in medical data where Death is an infrequent but important outcome. If that outcome is less than 1/10 of the cases, then there is a risk that it will be neglected in the training set. I sometimes. put more of the rare cases into the training set than in the total data set so as the learning engine will not dismiss it. I also make sure there are still some un-see minority cases in the test set. Good luck. Bye for now, George On Tue, Dec 1, 2020 at 3:31 PM Anatoliy <[hidden email]> wrote: Hi George! _______________________________________________ Wekalist mailing list -- [hidden email] Send posts to [hidden email] To unsubscribe send an email to [hidden email] To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html |
Hi George!
Thanks for the feedback! I realized I needed to increase the amount of data. Since, in my case, model correlation is highly dependent on fold cross-validation. As far as I understand, choosing the training set itself as a test is a rather dangerous thing, but if I don't intend to generalize the model, can I try it? regards Anatoliy -- Sent from: https://weka.8497.n7.nabble.com/ _______________________________________________ Wekalist mailing list -- [hidden email] Send posts to [hidden email] To unsubscribe send an email to [hidden email] To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html |
Hi George!
Yes, if I am not generalizing the model then I can probably do so. Why am I not generalizing? - this is a heuristic conclusion - there are a couple of attributes that I cannot take into account in the model, but which significantly affect generalization. Therefore, I decided to apply the model locally, where these attributes affect on average in the form of a constant. kind regards Anatoliy -- Sent from: https://weka.8497.n7.nabble.com/ _______________________________________________ Wekalist mailing list -- [hidden email] Send posts to [hidden email] To unsubscribe send an email to [hidden email] To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html |
Free forum by Nabble | Edit this page |