attribute evaluator performance

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

attribute evaluator performance

Michael Dittenbach
Hi all!

When I used chi-squared for feature ranking, I noticed a rather long
"calculation time" for a dataset with 11,000 instances and 188,913

I called:
weka.filters.supervised.attribute.AttributeSelection -S
weka.attributeSelection.Ranker -E
"weka.attributeSelection.ChiSquaredAttributeEval -B" -i input.arff -o
output-chi.arff -c last

By stepping through the code I found that it was not the chi-squared
computation that took ages (it is not that complex), but rather a method
call used for generating the relation name stated in the first line of the
ARFF file. All the indices of the features got enumerated and stitched
together into one long string (188,913 indices!).
It is quite handy to know what happened to the data by looking at the
relation name, but generating an ARFF file with features ranked by
chi-squared in a couple of minutes compared to several hours (or even a
day) makes quite a difference when performing lots of experiments.

Since this behaviour is independent of the ChiSquaredAttributeEval class,
the phenomenon also happens when using Information Gain, etc., and might
also delay other classes.

Here is the call hierarchy from bottom to top (the nasty stuff happens in

Range.getRanges() line: 119
Remove.getAttributeIndices() line: 293
Remove.getOptions() line: 132
Remove(Filter).setOutputFormat(Instances) line: 128
Remove.setInputFormat(Instances) line: 179
AttributeSelection.SelectAttributes(Instances) line: 780
AttributeSelection.batchFinished() line: 330
Filter.filterFile(Filter, String[]) line: 722
AttributeSelection.main(String[]) line: 445

I bypassed this "delay" by modifying the method
Filter.setOutputFormat(Instances). I removed the line

String [] options = ((OptionHandler)this).getOptions();

and manually create the string array with

String [] options = {"", "", ""};

I hope that this hint is useful to others.


Wekalist mailing list
[hidden email]