Random Forest Proximity

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Random Forest Proximity

Marcus Müller
Dear WEKA-Experts,

according to Leo Breiman, a proximity (or similarity) measure of the random forest for two different instances is the number of trees they end up in the same node divided by the total number of trees. Is there any built-in functionality like this, to evaluate the similarity of two instances in WEKA ?

Thank you very much,
Marcus  

_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Random Forest Proximity

Eibe Frank-2
Administrator
By “in the same node” you probably mean “in the same leaf node”? No, this is not possible in WEKA without writing some code.

You can use the PartitionMembership filter with RandomForest as the partition generator to get a membership indicator vector for each input instance. This vector will contain one attribute value for each node in the RandomForest (leaves and internal nodes!). The attribute value will be 1 if the corresponding node contains the input instance and 0 otherwise. (This is assuming standard single-instance input data and not multi-instance data.) The vector will be represented as a SparseInstance to save space.

You could then use this for clustering, etc., by applying a distance function such as Manhattan distance to compare the membership vectors.

Command-line example usage of the filter:

java -cp ~/weka-3-9-1/weka.jar weka.Run .PartitionMembership -W .RandomForest -i ~/datasets/UCI/iris.arff -c last

If you are willing to write some code, you can subclass RandomTree and change the relevant methods so that it only considers leaf nodes when generating the membership vectors.

Cheers,
Eibe

> On 7/08/2017, at 11:46 AM, Marcus Müller <[hidden email]> wrote:
>
> Dear WEKA-Experts,
>
> according to Leo Breiman, a proximity (or similarity) measure of the random forest for two different instances is the number of trees they end up in the same node divided by the total number of trees. Is there any built-in functionality like this, to evaluate the similarity of two instances in WEKA ?
>
> Thank you very much,
> Marcus  
> _______________________________________________
> Wekalist mailing list
> Send posts to: [hidden email]
> List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
> List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Random Forest Proximity

Marcus Müller-2
Hello Eibe,

thank you very much for your answer. If I understand you correctly, the PartitionMembership filter produces a sparse vector with information about the presence of an attribute in ALL nodes of the random forest. Did I get you right, that without manipulating the RandomTree class there is no way of distinguishing, whether a node in the PartitionMembership output is an internal or a leave node?

Thank you,
Marcus

2017-08-08 7:38 GMT+02:00 Eibe Frank <[hidden email]>:
By “in the same node” you probably mean “in the same leaf node”? No, this is not possible in WEKA without writing some code.

You can use the PartitionMembership filter with RandomForest as the partition generator to get a membership indicator vector for each input instance. This vector will contain one attribute value for each node in the RandomForest (leaves and internal nodes!). The attribute value will be 1 if the corresponding node contains the input instance and 0 otherwise. (This is assuming standard single-instance input data and not multi-instance data.) The vector will be represented as a SparseInstance to save space.

You could then use this for clustering, etc., by applying a distance function such as Manhattan distance to compare the membership vectors.

Command-line example usage of the filter:

java -cp ~/weka-3-9-1/weka.jar weka.Run .PartitionMembership -W .RandomForest -i ~/datasets/UCI/iris.arff -c last

If you are willing to write some code, you can subclass RandomTree and change the relevant methods so that it only considers leaf nodes when generating the membership vectors.

Cheers,
Eibe

> On 7/08/2017, at 11:46 AM, Marcus Müller <[hidden email]> wrote:
>
> Dear WEKA-Experts,
>
> according to Leo Breiman, a proximity (or similarity) measure of the random forest for two different instances is the number of trees they end up in the same node divided by the total number of trees. Is there any built-in functionality like this, to evaluate the similarity of two instances in WEKA ?
>
> Thank you very much,
> Marcus
> _______________________________________________
> Wekalist mailing list
> Send posts to: [hidden email]
> List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
> List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html


_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Random Forest Proximity

Eibe Frank-2
Administrator
Yes, that’s correct.

Cheers,
Eibe

> On 13/08/2017, at 10:08 PM, Marcus Müller <[hidden email]> wrote:
>
> Hello Eibe,
>
> thank you very much for your answer. If I understand you correctly, the PartitionMembership filter produces a sparse vector with information about the presence of an attribute in ALL nodes of the random forest. Did I get you right, that without manipulating the RandomTree class there is no way of distinguishing, whether a node in the PartitionMembership output is an internal or a leave node?
>
> Thank you,
> Marcus
>
> 2017-08-08 7:38 GMT+02:00 Eibe Frank <[hidden email]>:
> By “in the same node” you probably mean “in the same leaf node”? No, this is not possible in WEKA without writing some code.
>
> You can use the PartitionMembership filter with RandomForest as the partition generator to get a membership indicator vector for each input instance. This vector will contain one attribute value for each node in the RandomForest (leaves and internal nodes!). The attribute value will be 1 if the corresponding node contains the input instance and 0 otherwise. (This is assuming standard single-instance input data and not multi-instance data.) The vector will be represented as a SparseInstance to save space.
>
> You could then use this for clustering, etc., by applying a distance function such as Manhattan distance to compare the membership vectors.
>
> Command-line example usage of the filter:
>
> java -cp ~/weka-3-9-1/weka.jar weka.Run .PartitionMembership -W .RandomForest -i ~/datasets/UCI/iris.arff -c last
>
> If you are willing to write some code, you can subclass RandomTree and change the relevant methods so that it only considers leaf nodes when generating the membership vectors.
>
> Cheers,
> Eibe
>
> > On 7/08/2017, at 11:46 AM, Marcus Müller <[hidden email]> wrote:
> >
> > Dear WEKA-Experts,
> >
> > according to Leo Breiman, a proximity (or similarity) measure of the random forest for two different instances is the number of trees they end up in the same node divided by the total number of trees. Is there any built-in functionality like this, to evaluate the similarity of two instances in WEKA ?
> >
> > Thank you very much,
> > Marcus
> > _______________________________________________
> > Wekalist mailing list
> > Send posts to: [hidden email]
> > List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
> > List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
>
> _______________________________________________
> Wekalist mailing list
> Send posts to: [hidden email]
> List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
> List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
>
> _______________________________________________
> Wekalist mailing list
> Send posts to: [hidden email]
> List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
> List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Loading...