Is there any approach for selecting the best clustering?

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Is there any approach for selecting the best clustering?

eclipso
Hi.
The k-means algorithm is non deterministic and can deliver very different results, depending on the initialization of the centroids.
So, in real world scenarios, it is not easy to choose the best output of the k-means.
I would like to know if there is some approach for selecting the best k-means output in an automatic way.
Best regards.

_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Is there any approach for selecting the best clustering?

Eibe Frank-2
Administrator
For a fixed number of clusters k in SimpleKMeans, you can just pick the solution that gives you the smallest sum of squared errors on the training data (shown in the output as "Within cluster sum of squared errors”).

Incidentally, on the iris data, there is a nice correspondence between that measure and the classification error from a classes-to-clusters evaluation:

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

$ java weka.Run .SimpleKMeans -t ~/datasets/UCI/iris.arff -S 1 -N 3 -c last


=== Clustering stats for training data ===


kMeans
======

Number of iterations: 8
Within cluster sum of squared errors: 6.9981140048267605

Initial starting points (random):

Cluster 0: 7.7,3,6.1,2.3
Cluster 1: 6.3,2.5,4.9,1.5
Cluster 2: 6.4,2.7,5.3,1.9

Missing values globally replaced with mean/mode

Final cluster centroids:
                           Cluster#
Attribute      Full Data          0          1          2
                 (150.0)     (39.0)     (50.0)     (61.0)
=========================================================
sepallength       5.8433     6.8462      5.006     5.8885
sepalwidth         3.054     3.0821      3.418     2.7377
petallength       3.7587     5.7026      1.464     4.3967
petalwidth        1.1987     2.0795      0.244      1.418


Clustered Instances

0       39 ( 26%)
1       50 ( 33%)
2       61 ( 41%)


Class attribute: class
Classes to Clusters:

  0  1  2  <-- assigned to cluster
  0 50  0 | Iris-setosa
  3  0 47 | Iris-versicolor
 36  0 14 | Iris-virginica

Cluster 0 <-- Iris-virginica
Cluster 1 <-- Iris-setosa
Cluster 2 <-- Iris-versicolor

Incorrectly clustered instances : 17.0 11.3333 %

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

$ java weka.Run .SimpleKMeans -t ~/datasets/UCI/iris.arff -S 100 -N 3 -c last


=== Clustering stats for training data ===


kMeans
======

Number of iterations: 4
Within cluster sum of squared errors: 10.908274989622527

Initial starting points (random):

Cluster 0: 6.4,3.2,5.3,2.3
Cluster 1: 5.4,3.4,1.5,0.4
Cluster 2: 4.4,3,1.3,0.2

Missing values globally replaced with mean/mode

Final cluster centroids:
                           Cluster#
Attribute      Full Data          0          1          2
                 (150.0)     (96.0)     (32.0)     (22.0)
=========================================================
sepallength       5.8433     6.3146     5.1781     4.7545
sepalwidth         3.054     2.8958     3.6313     2.9045
petallength       3.7587      4.974     1.4969     1.7455
petalwidth        1.1987     1.7031     0.2781     0.3364


Clustered Instances

0       96 ( 64%)
1       32 ( 21%)
2       22 ( 15%)


Class attribute: class
Classes to Clusters:

  0  1  2  <-- assigned to cluster
  0 32 18 | Iris-setosa
 46  0  4 | Iris-versicolor
 50  0  0 | Iris-virginica

Cluster 0 <-- Iris-virginica
Cluster 1 <-- Iris-setosa
Cluster 2 <-- Iris-versicolor

Incorrectly clustered instances : 64.0 42.6667 %

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

The situation becomes less clear when you want to compare solutions that are based on different numbers of clusters.

Cheers,
Eibe

> On 27/08/2019, at 5:38 AM, Marcelino Borges <[hidden email]> wrote:
>
> Hi.
> The k-means algorithm is non deterministic and can deliver very different results, depending on the initialization of the centroids.
> So, in real world scenarios, it is not easy to choose the best output of the k-means.
> I would like to know if there is some approach for selecting the best k-means output in an automatic way.
> Best regards.
> _______________________________________________
> Wekalist mailing list
> Send posts to: [hidden email]
> To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
> List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html