record count in leafs of random tree larger than training set
I am trying to understand the string representation of a weka random tree and coming across what seems like inconsistancies.
Training set has 1000 records (instances). Looking at the string, the number of instances in the leafs seems to add up to 1030. How is this possible? Am I misinterpreting the string somehow?
Moreover when I partition the data using the features and values in the tree description I get different counts than what is shown in the tree description.
See the complete run description below.
Note the following:
`Total Number of Instances 1000`
while collecting all the counts from the leafs: (10/0),(1/0),(354/0),(18/1),(37/0),(11/0),(9/4),(5/0),(7/3),(5/0),(20/0),(1/0),(2/0),(168/0),(1/0),(145/0),(61/3),(3/1),(5/0),(44/13),(8/0),(10/2),(63/0),(8/3),(4/0)
leads to a total of 1030.
This training set was generated in sklearn:
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=1000, n_features=4,
Here is the run description:
=== Run information ===
Scheme: weka.classifiers.trees.RandomTree -K 0 -M 1.0 -V 0.001 -S 1 -depth 5
Test mode: evaluate on training data
Re: record count in leafs of random tree larger than training set
The number preceding the forward slash gives the occupancy count (more precisely, the total weight of all the training instances in that leaf). The number following the forward slash gives the number of misclassified training instances (more precisely, the total weight of all the misclassified instances).
If you add up the numbers preceding each forward slash only, you will find that the sum is indeed 1000.