Clustering method

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Clustering method

valerio jus
Dear all, 

I'm working data clustering using the "Canopy" clusterer. My issue is related to the strategy of determining the numerical value of  T1 and T2. I read the associated article:

A. McCallum, K. Nigam, L.H. Ungar: Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching. In: Proceedings of the sixth ACM SIGKDD internation conference on knowledge discovery and data mining ACM-SIAM symposium on Discrete algorithms, 169-178, 2000.

but did not figure out why the default values of T1 and T2 were set at -1.25 and -1.0, respectively.

In sum, my questions are:

1- What is the range that one can select within, when optimizing the values of T1 and T2?

2- Why T1 = -1.25 and T2 = -1.0?

2- In general, how to select the values of both T1 and T2? Is there any strategy for that, or its an exploratory issue?


Cheers, 
Valerio



_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Clustering method

Mark Hall
The defaults for T1 and T2 are explained in the help documentation for the clusterer. In short, the default of -1.0 for T2 indicates that the algorithm should use a (admittedly hacky) heuristic based on attribute standard deviation to set this value. User-supplied values > 0 are taken as-is. For T1, you can either supply a value > 0, and like T2 it is used as-is, or the absolute value of a value < 0 is used as a multiplier of T2 to set the value for T1.

There are no real guidelines on how to set T2 and T1. You should just experiment to see how different values affect the result. One thing you could try is to wrap Canopy in the MakeDensityBasedClusterer. Then you can take a look at the log likelihood of some test data (using a percentage split or separate test set) in order to see the effects of varying the values of T2 ant T1.

Cheers,
Mark.

On 31/07/17, 6:50 AM, "Valerio jus" <[hidden email] on behalf of [hidden email]> wrote:

    Dear all,
    I'm working data clustering using the "Canopy" clusterer. My issue is related to the strategy of determining the numerical value of  T1 and T2. I read the associated article:
   
    A. McCallum, K. Nigam, L.H. Ungar: Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching. In: Proceedings of the sixth ACM SIGKDD internation conference on knowledge discovery and data mining ACM-SIAM symposium on Discrete algorithms, 169-178, 2000.
   
    but did not figure out why the default values of T1 and T2 were set at -1.25 and -1.0, respectively.
   
    In sum, my questions are:
   
    1- What is the range that one can select within, when optimizing the values of T1 and T2?
   
    2- Why T1 = -1.25 and T2 = -1.0?
   
    2- In general, how to select the values of both T1 and T2? Is there any strategy for that, or its an exploratory issue?
   
   
    Cheers,
    Valerio
   
   
   
    _______________________________________________
    Wekalist mailing list
    Send posts to: [hidden email]
    List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
    List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
   


_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Clustering method

valerio jus
This is quite helpful. I've followed your advice by clustering the glass data through wrapping the "Canopy" method in the "MakeDensityBasedClusterer" clusterer. I had Log likelihood = 0.88555. 

How to interpret the Log likelihood I had? Is this a good result?

Cheers, 
Vlaerio

On Mon, Jul 31, 2017 at 6:56 AM, Mark Hall <[hidden email]> wrote:
The defaults for T1 and T2 are explained in the help documentation for the clusterer. In short, the default of -1.0 for T2 indicates that the algorithm should use a (admittedly hacky) heuristic based on attribute standard deviation to set this value. User-supplied values > 0 are taken as-is. For T1, you can either supply a value > 0, and like T2 it is used as-is, or the absolute value of a value < 0 is used as a multiplier of T2 to set the value for T1.

There are no real guidelines on how to set T2 and T1. You should just experiment to see how different values affect the result. One thing you could try is to wrap Canopy in the MakeDensityBasedClusterer. Then you can take a look at the log likelihood of some test data (using a percentage split or separate test set) in order to see the effects of varying the values of T2 ant T1.

Cheers,
Mark.

On 31/07/17, 6:50 AM, "Valerio jus" <[hidden email] on behalf of [hidden email]> wrote:

    Dear all,
    I'm working data clustering using the "Canopy" clusterer. My issue is related to the strategy of determining the numerical value of  T1 and T2. I read the associated article:

    A. McCallum, K. Nigam, L.H. Ungar: Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching. In: Proceedings of the sixth ACM SIGKDD internation conference on knowledge discovery and data mining ACM-SIAM symposium on Discrete algorithms, 169-178, 2000.

    but did not figure out why the default values of T1 and T2 were set at -1.25 and -1.0, respectively.

    In sum, my questions are:

    1- What is the range that one can select within, when optimizing the values of T1 and T2?

    2- Why T1 = -1.25 and T2 = -1.0?

    2- In general, how to select the values of both T1 and T2? Is there any strategy for that, or its an exploratory issue?


    Cheers,
    Valerio



    _______________________________________________
    Wekalist mailing list
    Send posts to: [hidden email]
    List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
    List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html



_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html


_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Clustering method

Eibe Frank-2
Administrator
Mark suggested this as a way to choose an appropriate model. The idea is to compare different models to maximise this score.

Cheers,
Eibe

> On 31/07/2017, at 8:24 PM, Valerio jus <[hidden email]> wrote:
>
> This is quite helpful. I've followed your advice by clustering the glass data through wrapping the "Canopy" method in the "MakeDensityBasedClusterer" clusterer. I had Log likelihood = 0.88555.
>
> How to interpret the Log likelihood I had? Is this a good result?
>
> Cheers,
> Vlaerio
>
> On Mon, Jul 31, 2017 at 6:56 AM, Mark Hall <[hidden email]> wrote:
> The defaults for T1 and T2 are explained in the help documentation for the clusterer. In short, the default of -1.0 for T2 indicates that the algorithm should use a (admittedly hacky) heuristic based on attribute standard deviation to set this value. User-supplied values > 0 are taken as-is. For T1, you can either supply a value > 0, and like T2 it is used as-is, or the absolute value of a value < 0 is used as a multiplier of T2 to set the value for T1.
>
> There are no real guidelines on how to set T2 and T1. You should just experiment to see how different values affect the result. One thing you could try is to wrap Canopy in the MakeDensityBasedClusterer. Then you can take a look at the log likelihood of some test data (using a percentage split or separate test set) in order to see the effects of varying the values of T2 ant T1.
>
> Cheers,
> Mark.
>
> On 31/07/17, 6:50 AM, "Valerio jus" <[hidden email] on behalf of [hidden email]> wrote:
>
>     Dear all,
>     I'm working data clustering using the "Canopy" clusterer. My issue is related to the strategy of determining the numerical value of  T1 and T2. I read the associated article:
>
>     A. McCallum, K. Nigam, L.H. Ungar: Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching. In: Proceedings of the sixth ACM SIGKDD internation conference on knowledge discovery and data mining ACM-SIAM symposium on Discrete algorithms, 169-178, 2000.
>
>     but did not figure out why the default values of T1 and T2 were set at -1.25 and -1.0, respectively.
>
>     In sum, my questions are:
>
>     1- What is the range that one can select within, when optimizing the values of T1 and T2?
>
>     2- Why T1 = -1.25 and T2 = -1.0?
>
>     2- In general, how to select the values of both T1 and T2? Is there any strategy for that, or its an exploratory issue?
>
>
>     Cheers,
>     Valerio
>
>
>
>     _______________________________________________
>     Wekalist mailing list
>     Send posts to: [hidden email]
>     List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
>     List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
>
>
>
> _______________________________________________
> Wekalist mailing list
> Send posts to: [hidden email]
> List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
> List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
>
> _______________________________________________
> Wekalist mailing list
> Send posts to: [hidden email]
> List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
> List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Clustering method

valerio jus
Thanks for both of you. The idea is quite clear now.

Cheers, 
Vlerio

On Tue, Aug 1, 2017 at 10:03 AM, Eibe Frank <[hidden email]> wrote:
Mark suggested this as a way to choose an appropriate model. The idea is to compare different models to maximise this score.

Cheers,
Eibe

> On 31/07/2017, at 8:24 PM, Valerio jus <[hidden email]> wrote:
>
> This is quite helpful. I've followed your advice by clustering the glass data through wrapping the "Canopy" method in the "MakeDensityBasedClusterer" clusterer. I had Log likelihood = 0.88555.
>
> How to interpret the Log likelihood I had? Is this a good result?
>
> Cheers,
> Vlaerio
>
> On Mon, Jul 31, 2017 at 6:56 AM, Mark Hall <[hidden email]> wrote:
> The defaults for T1 and T2 are explained in the help documentation for the clusterer. In short, the default of -1.0 for T2 indicates that the algorithm should use a (admittedly hacky) heuristic based on attribute standard deviation to set this value. User-supplied values > 0 are taken as-is. For T1, you can either supply a value > 0, and like T2 it is used as-is, or the absolute value of a value < 0 is used as a multiplier of T2 to set the value for T1.
>
> There are no real guidelines on how to set T2 and T1. You should just experiment to see how different values affect the result. One thing you could try is to wrap Canopy in the MakeDensityBasedClusterer. Then you can take a look at the log likelihood of some test data (using a percentage split or separate test set) in order to see the effects of varying the values of T2 ant T1.
>
> Cheers,
> Mark.
>
> On 31/07/17, 6:50 AM, "Valerio jus" <[hidden email] on behalf of [hidden email]> wrote:
>
>     Dear all,
>     I'm working data clustering using the "Canopy" clusterer. My issue is related to the strategy of determining the numerical value of  T1 and T2. I read the associated article:
>
>     A. McCallum, K. Nigam, L.H. Ungar: Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching. In: Proceedings of the sixth ACM SIGKDD internation conference on knowledge discovery and data mining ACM-SIAM symposium on Discrete algorithms, 169-178, 2000.
>
>     but did not figure out why the default values of T1 and T2 were set at -1.25 and -1.0, respectively.
>
>     In sum, my questions are:
>
>     1- What is the range that one can select within, when optimizing the values of T1 and T2?
>
>     2- Why T1 = -1.25 and T2 = -1.0?
>
>     2- In general, how to select the values of both T1 and T2? Is there any strategy for that, or its an exploratory issue?
>
>
>     Cheers,
>     Valerio
>
>
>
>     _______________________________________________
>     Wekalist mailing list
>     Send posts to: [hidden email]
>     List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
>     List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
>
>
>
> _______________________________________________
> Wekalist mailing list
> Send posts to: [hidden email]
> List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
> List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
>
> _______________________________________________
> Wekalist mailing list
> Send posts to: [hidden email]
> List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
> List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html


_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Loading...