record count in leafs of random tree larger than training set

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

record count in leafs of random tree larger than training set

yosi.hammer
Hi all,

I am trying to understand the string representation of a weka random tree and coming across what seems like inconsistancies.

Training set has 1000 records (instances). Looking at the string, the number of instances in the leafs seems to add up to 1030. How is this possible? Am I misinterpreting the string somehow?

Moreover when I partition the data using the features and values in the tree description I get different counts than what is shown in the tree description.

See the complete run description below.

Note the following:

`Total Number of Instances             1000`

while collecting all the counts from the leafs: (10/0),(1/0),(354/0),(18/1),(37/0),(11/0),(9/4),(5/0),(7/3),(5/0),(20/0),(1/0),(2/0),(168/0),(1/0),(145/0),(61/3),(3/1),(5/0),(44/13),(8/0),(10/2),(63/0),(8/3),(4/0)

leads to a total of 1030.

This training set was generated in sklearn:

from sklearn.datasets import make_classification
X, y = make_classification(n_samples=1000, n_features=4,
                         n_informative=2, n_redundant=0,
                            random_state=0, shuffle=False)

Here is the run description:

=== Run information ===

Scheme:       weka.classifiers.trees.RandomTree -K 0 -M 1.0 -V 0.001 -S 1 -depth 5
Relation:     test-data
Instances:    1000
Attributes:   5
              feature1
              feature2
              feature3
              feature4
              class
Test mode:    evaluate on training data

=== Classifier model (full training set) ===


RandomTree
==========

feature2 < -0.27
|   feature2 < -0.61
|   |   feature3 < 1.09
|   |   |   feature2 < -2.41
|   |   |   |   feature2 < -2.45 : 0 (10/0)
|   |   |   |   feature2 >= -2.45 : 1 (1/0)
|   |   |   feature2 >= -2.41
|   |   |   |   feature2 < -0.7 : 0 (354/0)
|   |   |   |   feature2 >= -0.7 : 0 (18/1)
|   |   feature3 >= 1.09
|   |   |   feature2 < -0.94 : 0 (37/0)
|   |   |   feature2 >= -0.94
|   |   |   |   feature1 < -0.02 : 0 (11/0)
|   |   |   |   feature1 >= -0.02 : 0 (9/4)
|   feature2 >= -0.61
|   |   feature3 < -0.34
|   |   |   feature1 < 1.19 : 1 (5/0)
|   |   |   feature1 >= 1.19
|   |   |   |   feature2 < -0.39 : 0 (7/3)
|   |   |   |   feature2 >= -0.39 : 0 (5/0)
|   |   feature3 >= -0.34
|   |   |   feature2 < -0.32 : 0 (20/0)
|   |   |   feature2 >= -0.32
|   |   |   |   feature2 < -0.3 : 1 (1/0)
|   |   |   |   feature2 >= -0.3 : 0 (2/0)
feature2 >= -0.27
|   feature1 < 1.19
|   |   feature3 < -0.11 : 1 (168/0)
|   |   feature3 >= -0.11
|   |   |   feature3 < -0.1 : 0 (1/0)
|   |   |   feature3 >= -0.1
|   |   |   |   feature4 < 0.59 : 1 (145/0)
|   |   |   |   feature4 >= 0.59 : 1 (61/3)
|   feature1 >= 1.19
|   |   feature2 < 0.82
|   |   |   feature2 < -0.18
|   |   |   |   feature2 < -0.21 : 0 (3/1)
|   |   |   |   feature2 >= -0.21 : 0 (5/0)
|   |   |   feature2 >= -0.18
|   |   |   |   feature1 < 2.28 : 1 (44/13)
|   |   |   |   feature1 >= 2.28 : 0 (8/0)
|   |   feature2 >= 0.82
|   |   |   feature1 < 2.67
|   |   |   |   feature1 < 1.33 : 1 (10/2)
|   |   |   |   feature1 >= 1.33 : 1 (63/0)
|   |   |   feature1 >= 2.67
|   |   |   |   feature1 < 2.97 : 0 (8/3)
|   |   |   |   feature1 >= 2.97 : 1 (4/0)

Size of the tree : 49
Max depth of tree: 5

Time taken to build model: 0.05 seconds

=== Evaluation on training set ===

Time taken to test model on training data: 0.03 seconds

=== Summary ===

Correctly Classified Instances         970               97      %
Incorrectly Classified Instances        30                3      %
Kappa statistic                          0.94  
Mean absolute error                      0.0421
Root mean squared error                  0.145
Relative absolute error                  8.4142 %
Root relative squared error             29.0073 %
Total Number of Instances             1000    

=== Detailed Accuracy By Class ===

                 TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
                 0.964    0.024    0.976      0.964    0.970      0.940    0.997     0.996     0
                 0.976    0.036    0.964      0.976    0.970      0.940    0.997     0.995     1
Weighted Avg.    0.970    0.030    0.970      0.970    0.970      0.940    0.997     0.996    

=== Confusion Matrix ===

   a   b   <-- classified as
 486  18 |   a = 0
  12 484 |   b = 1
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: record count in leafs of random tree larger than training set

Eibe Frank-3
The number preceding the forward slash gives the occupancy count (more precisely, the total weight of all the training instances in that leaf). The number following the forward slash gives the number of misclassified training instances (more precisely, the total weight of all the misclassified instances).

If you add up the numbers preceding each forward slash only, you will find that the sum is indeed 1000.

Cheers,
Eibe

_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html