RandomForest is a simple wrapper that applies Bagging with RandomTree as the base learner. RandomTree does not form bootstrap samples of the data, Bagging does that part.

When performing classification, RandomTree grows a tree by selecting splits based on information gain (just like Quinlan’s original ID3 tree learner). For regression problems, RandomTree selects the split that minimises squared error (locally, for the data at the node being split, and considering the mean as the predictor for each candidate subset to calculate squared error).

RandomTree essentially works the same way as REPTree assuming REPTree is configured to build unpruned trees. The difference is that RandomTree only considers a subset of randomly chosen attributes for splitting at each node (thus, the name RandomTree). Computationally, REPTree avoids repeated sorting by storing sort orders in an internal data structure. RandomTree does not do that because it does not make sense to do that separately for each individual tree in a RandomForest: it is too expensive in that scenario.

RandomTree performs multi-way splits on nominal attributes, with one branch for each value, which CART does not do. CART uses binary splits for nominal attributes. Another difference is the treatment of missing values. RandomTree uses C4.5’s method of fractional instances to deal with missing values. Information gain and squared error are computed by taking the fractional data into account. CART uses surrogate splits for dealing with missing values.

If I remember correctly, CART also uses squared error for split selection in the regression case (but you should check that). Assuming that’s true, the data you have is for a regression problem, only has numeric attributes, and no missing values, the CART tree you grow is unpruned, and you configure both CART and RandomTree to pick the same number of randomly chosen attributes to select from for splitting at each node, you should get very similar results on average (obviously, there will be a random element). You may also need to play with the minimum leaf size parameter.

You could actually do the comparison in WEKA, by installing the RPlugin package and running R’s rpart or random forest through the MLRClassifier that comes with the RPlugin.

Cheers,

Eibe

Dear Sir,

I'm YOU Weizhen who is a student in France. I have been using Weka for a long time. It is really a meaningful tool to do data mining. Now I have a question about a classifier called RandomTree (in the path: classifiers.trees.RandomTree). I didn't find enough introductions about its principles. What I have found include:

In page 404, Table 10.5, it tells 'Construct a tree that considers a given number of random features at each node';

In page 407, it tells 'Trees built by RandomTree chooses a test based on a given number of random features at each node, performing no pruning.';

In page 416, it tells 'The random forest scheme was mentioned on page 407. It is really a metalearner, but Weka includes it among the decision tree methods because it is hardwired to a particular classifier, RandomTree.'.

Sir, I want to know whether the RandomTree is built from a bootstrap sample. If it does, it will be the same as the CART in a Random forest. If possible, can you tell me other characteristics of RandomTree that differ from CART? You can also share me a document about this.