Dear all,
I'm working data clustering using the "Canopy" clusterer. My issue is related to the strategy of determining the numerical value of T1 and T2. I read the associated article: A. McCallum, K. Nigam, L.H. Ungar: Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching. In: Proceedings of the sixth ACM SIGKDD internation conference on knowledge discovery and data mining ACMSIAM symposium on Discrete algorithms, 169178, 2000. but did not figure out why the default values of T1 and T2 were set at 1.25 and 1.0, respectively. In sum, my questions are: 1 What is the range that one can select within, when optimizing the values of T1 and T2? 2 Why T1 = 1.25 and T2 = 1.0? 2 In general, how to select the values of both T1 and T2? Is there any strategy for that, or its an exploratory issue? Cheers, Valerio _______________________________________________ Wekalist mailing list Send posts to: [hidden email] List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html 
The defaults for T1 and T2 are explained in the help documentation for the clusterer. In short, the default of 1.0 for T2 indicates that the algorithm should use a (admittedly hacky) heuristic based on attribute standard deviation to set this value. Usersupplied values > 0 are taken asis. For T1, you can either supply a value > 0, and like T2 it is used asis, or the absolute value of a value < 0 is used as a multiplier of T2 to set the value for T1.
There are no real guidelines on how to set T2 and T1. You should just experiment to see how different values affect the result. One thing you could try is to wrap Canopy in the MakeDensityBasedClusterer. Then you can take a look at the log likelihood of some test data (using a percentage split or separate test set) in order to see the effects of varying the values of T2 ant T1. Cheers, Mark. On 31/07/17, 6:50 AM, "Valerio jus" <[hidden email] on behalf of [hidden email]> wrote: Dear all, I'm working data clustering using the "Canopy" clusterer. My issue is related to the strategy of determining the numerical value of T1 and T2. I read the associated article: A. McCallum, K. Nigam, L.H. Ungar: Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching. In: Proceedings of the sixth ACM SIGKDD internation conference on knowledge discovery and data mining ACMSIAM symposium on Discrete algorithms, 169178, 2000. but did not figure out why the default values of T1 and T2 were set at 1.25 and 1.0, respectively. In sum, my questions are: 1 What is the range that one can select within, when optimizing the values of T1 and T2? 2 Why T1 = 1.25 and T2 = 1.0? 2 In general, how to select the values of both T1 and T2? Is there any strategy for that, or its an exploratory issue? Cheers, Valerio _______________________________________________ Wekalist mailing list Send posts to: [hidden email] List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html _______________________________________________ Wekalist mailing list Send posts to: [hidden email] List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html 
This is quite helpful. I've followed your advice by clustering the glass data through wrapping the "Canopy" method in the "MakeDensityBasedClusterer" clusterer. I had Log likelihood = 0.88555. How to interpret the Log likelihood I had? Is this a good result? Cheers, Vlaerio On Mon, Jul 31, 2017 at 6:56 AM, Mark Hall <[hidden email]> wrote: The defaults for T1 and T2 are explained in the help documentation for the clusterer. In short, the default of 1.0 for T2 indicates that the algorithm should use a (admittedly hacky) heuristic based on attribute standard deviation to set this value. Usersupplied values > 0 are taken asis. For T1, you can either supply a value > 0, and like T2 it is used asis, or the absolute value of a value < 0 is used as a multiplier of T2 to set the value for T1. _______________________________________________ Wekalist mailing list Send posts to: [hidden email] List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html 
Administrator

Mark suggested this as a way to choose an appropriate model. The idea is to compare different models to maximise this score.
Cheers, Eibe > On 31/07/2017, at 8:24 PM, Valerio jus <[hidden email]> wrote: > > This is quite helpful. I've followed your advice by clustering the glass data through wrapping the "Canopy" method in the "MakeDensityBasedClusterer" clusterer. I had Log likelihood = 0.88555. > > How to interpret the Log likelihood I had? Is this a good result? > > Cheers, > Vlaerio > > On Mon, Jul 31, 2017 at 6:56 AM, Mark Hall <[hidden email]> wrote: > The defaults for T1 and T2 are explained in the help documentation for the clusterer. In short, the default of 1.0 for T2 indicates that the algorithm should use a (admittedly hacky) heuristic based on attribute standard deviation to set this value. Usersupplied values > 0 are taken asis. For T1, you can either supply a value > 0, and like T2 it is used asis, or the absolute value of a value < 0 is used as a multiplier of T2 to set the value for T1. > > There are no real guidelines on how to set T2 and T1. You should just experiment to see how different values affect the result. One thing you could try is to wrap Canopy in the MakeDensityBasedClusterer. Then you can take a look at the log likelihood of some test data (using a percentage split or separate test set) in order to see the effects of varying the values of T2 ant T1. > > Cheers, > Mark. > > On 31/07/17, 6:50 AM, "Valerio jus" <[hidden email] on behalf of [hidden email]> wrote: > > Dear all, > I'm working data clustering using the "Canopy" clusterer. My issue is related to the strategy of determining the numerical value of T1 and T2. I read the associated article: > > A. McCallum, K. Nigam, L.H. Ungar: Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching. In: Proceedings of the sixth ACM SIGKDD internation conference on knowledge discovery and data mining ACMSIAM symposium on Discrete algorithms, 169178, 2000. > > but did not figure out why the default values of T1 and T2 were set at 1.25 and 1.0, respectively. > > In sum, my questions are: > > 1 What is the range that one can select within, when optimizing the values of T1 and T2? > > 2 Why T1 = 1.25 and T2 = 1.0? > > 2 In general, how to select the values of both T1 and T2? Is there any strategy for that, or its an exploratory issue? > > > Cheers, > Valerio > > > > _______________________________________________ > Wekalist mailing list > Send posts to: [hidden email] > List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist > List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html > > > > _______________________________________________ > Wekalist mailing list > Send posts to: [hidden email] > List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist > List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html > > _______________________________________________ > Wekalist mailing list > Send posts to: [hidden email] > List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist > List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html _______________________________________________ Wekalist mailing list Send posts to: [hidden email] List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html 
Thanks for both of you. The idea is quite clear now. Cheers, Vlerio On Tue, Aug 1, 2017 at 10:03 AM, Eibe Frank <[hidden email]> wrote: Mark suggested this as a way to choose an appropriate model. The idea is to compare different models to maximise this score. _______________________________________________ Wekalist mailing list Send posts to: [hidden email] List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html 
Free forum by Nabble  Edit this page 