Re: Regarding the statistical test for the predictability of model
No, the paired t-test is a parametric test and the Wilcoxon signed-rank test is a non-parametric test. Both are often applied to pairs of observed samples though (e.g., differences in classification accuracy for two methods A and B on the same train/test sets).
If the differences in pairs of observed samples are not normally distributed, the Wilcoxon signed-rank test is often recommended. However, in the context of evaluating results from k-fold cross-validation or the repeated hold-out method, both tests are actually problematic because the observed differences are based on overlapping training sets and, for the repeated hold-out method and repeated k-fold cross-validation, overlapping test sets. This means application of either test is problematic when trying to establish whether a particular learning algorithm A outperforms another learning algorithm B on data taken from a certain domain: you will find more “significant” differences than there really are (i.e., the Type I error is greater than the significance level specified by the user). WEKA’s Experimenter uses the “corrected” resampled t-test, which applies a correction to the t-test’s test statistic to compensate for the dependency due to overlapping subsamples of data. Empirical results indicate that it works reasonably well (the Type I error is close to the significance level specified by the user).
An extreme example showing the pathological behaviour is where you just generate synthetic, completely random data from a uniform distribution, with no dependency between the class and the predictor attributes. No learning algorithm can possible outperform another one on data generated in this way: there is nothing to learn. However, if you repeatedly run 10 times 10-fold cross-validation for two methods A and B with a standard paired t-test on data generated this way, with a significance level of 5%, you will detect a “difference” in more than 5% of the cases (i.e., the Type I error is inflated). In fact, you can ramp up the number of runs, e.g., to 100 times 10-fold cross-validation, to increase the chance of observing a “significant" difference!
I want to know the concept of applying the "Paired T-tester corrected" for
statistical comparison of Model A.
For the model A I've taken the "WEKA IDE"
For Model A
1) I've taken the "mushroom dataset" for "Model A" and used the feature
selection technique (PCA) and then applied the "naive bayes classifier" the
naive bayes classifier is showing the accuracy of 75%. Why I've used the
naive bayes is that when evaluated with other classifiers ( SVM, Random
forest or SMO) the naive bayes is having the best accuracy of all the three.
That is reason for choosing Naive Bayes classifier.
2) Now I've applied the "Paired T-tester corrected" to know whether the
Model A is having any statistical difference. The same " mushroom dataset"
is loaded with same " naive bayes classifier" and tested the Model A with
"Paired T-tester". The accuracy is coming to be 78%.
The classifier accuracy is 75%
The classifier accuracy with Paired T-tester is 78% ( The difference of 3%).
Is this approach is correct for testing the Model A ? Why I'm asking is
that I've used the same "mushroom dataset" and same " naive bayes
classifier" to test the Model A with Paired T-tester.