When I used chi-squared for feature ranking, I noticed a rather long
"calculation time" for a dataset with 11,000 instances and 188,913
"weka.attributeSelection.ChiSquaredAttributeEval -B" -i input.arff -o
output-chi.arff -c last
By stepping through the code I found that it was not the chi-squared
computation that took ages (it is not that complex), but rather a method
call used for generating the relation name stated in the first line of the
ARFF file. All the indices of the features got enumerated and stitched
together into one long string (188,913 indices!).
It is quite handy to know what happened to the data by looking at the
relation name, but generating an ARFF file with features ranked by
chi-squared in a couple of minutes compared to several hours (or even a
day) makes quite a difference when performing lots of experiments.
Since this behaviour is independent of the ChiSquaredAttributeEval class,
the phenomenon also happens when using Information Gain, etc., and might
also delay other classes.
Here is the call hierarchy from bottom to top (the nasty stuff happens in