A note on something that is not well-known about the algorithm: When you have numeric attributes, the most expensive part is potentially the search for a split point equalling a numeric value that actually occurs in the training set. With WEKA’s implementation (and the original C4.5 implementation) this requires a scan through the entire training set for each node of the tree that exhibits a split on a numeric attribute. Depending on the data source, the number of nodes in the tree can be O(n), where n is the number of training instances, making the time complexity for this part O(n^2). That’s why recent versions of J48 have the "-doNotMakeSplitPointActualValue” flag that prevents this expensive search (which does not appear to have a strong raison d’être in the first place).
This post has NOT been accepted by the mailing list yet.
For a given data set I found that WEKA's j48 results in a different attribute as the first split (at the root) when J48 is running without pruning, than what it selects as the first split when pruning is turned off.
In one case it selected a numeric attribute (when pruning was turned off), hence I was wondering if there is any random element to the search of split point and thus the infoGain computed was might be different.