ClassificationViaClustering - More Classes Than Clusters

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

ClassificationViaClustering - More Classes Than Clusters

TobiasH
Hello colleagues,

I am a Rooky in the Data Mining field and have an understanding problem of
the results of /ClassificationViaClustering/-Classifier.

As far as I understand it uses the same routine as the /"Classes to clusters
evaluation"/ in the Cluster-Panel, meaning that the majority class for each
cluster is determined (which also gives the smallest error).

So, I have a dataset with 5115 instances, 29 attributes incl. the nominal
class-attribute with 4 different values (classes).

I use /SimpleKMeans/ in /ClassificationViaClustering/ with /numClusters/ = 2
(donĀ“t ask why... okay, in case someone is interested in: I want to run
k-Means and other Clusterer with k = 1,2,...,20 on different datasets, where
the class is binned to 4,8,16... classes.)

In both (k) Clusters all four classes are present.

The output that I don't understand is:
-----------------------
Clusters to classes mapping:
  1. Cluster: '(-inf-12.495]' (1)
  2. Cluster: '(16.05-20.25]' (3)

Classes to clusters mapping:
  1. Class ('(-inf-12.495]'): 1. Cluster
  2. Class ('(12.495-16.05]'): no cluster
  3. Class ('(16.05-20.25]'): 2. Cluster
  4. Class ('(20.25-inf)'): no cluster

=== Stratified cross-validation ===
=== Summary ===

Correctly Classified Instances        1504               29.4037 %
Incorrectly Classified Instances      3611               70.5963 %
Kappa statistic                          0.058
Mean absolute error                      0.353
Root mean squared error                  0.5941
Relative absolute error                 94.1288 %
Root relative squared error            137.207  %
Total Number of Instances             5115    

=== Detailed Accuracy By Class ===

                 TP Rate  FP Rate  Precision  Recall   F-Measure  MCC    
ROC Area  PRC Area  Class
                 0,471    0,357    0,306      0,471    0,371      0,102  
0,557     0,276     '(-inf-12.495]'
                 0,000    0,000    ?          0,000    ?          ?      
0,500     0,248     '(12.495-16.05]'
                 0,215    0,139    0,341      0,215    0,264      0,090  
0,538     0,270     '(16.05-20.25]'
                 0,487    0,446    0,268      0,487    0,346      0,036  
0,520     0,259     '(20.25-inf)'
Weighted Avg.    0,294    0,236    ?          0,294    ?          ?      
0,529     0,264    

=== Confusion Matrix ===

   a   b   c   d   <-- classified as
 603   0 134 542 |   a = '(-inf-12.495]'
 515   0 179 577 |   b = '(12.495-16.05]'
 416   0 276 589 |   c = '(16.05-20.25]'
 438   0 221 625 |   d = '(20.25-inf)'

--------------------------------------------
So okay, Cluster 1 is assigned to class "a" (-inf-12.495) and Cluster 2 to
class "c".
My understanding is, that all 3424 instances of Cluster 1 should now be
classified as "a" and all 1691 instances of Cluster 2 should be classified
as class "c", right?
But as you can see, there are 3 columns filled in the confusion matrix.

So, why is this?
Where is my failure? What do I miss?

Any help is appreciated!

Kind regards
TobiasH




--
Sent from: https://weka.8497.n7.nabble.com/
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: ClassificationViaClustering - More Classes Than Clusters

Peter Reutemann
[...]

> The output that I don't understand is:
> -----------------------
> Clusters to classes mapping:
>   1. Cluster: '(-inf-12.495]' (1)
>   2. Cluster: '(16.05-20.25]' (3)
>
> Classes to clusters mapping:
>   1. Class ('(-inf-12.495]'): 1. Cluster
>   2. Class ('(12.495-16.05]'): no cluster
>   3. Class ('(16.05-20.25]'): 2. Cluster
>   4. Class ('(20.25-inf)'): no cluster

This bit is from a single model that was built on the full training
data. You omitted the following text from your post:
=== Classifier model (full training set) ===


> === Stratified cross-validation ===
> === Summary ===

[...]

> === Confusion Matrix ===
>
>    a   b   c   d   <-- classified as
>  603   0 134 542 |   a = '(-inf-12.495]'
>  515   0 179 577 |   b = '(12.495-16.05]'
>  416   0 276 589 |   c = '(16.05-20.25]'
>  438   0 221 625 |   d = '(20.25-inf)'

The confusion matrix is generated from aggregated results obtained
through cross-validation (most likely 10-fold, therefore from 10
different models). Having different splits for train/test will most
likely impact the models and therefore the number of clusters that get
assigned to classes.

Cheers, Peter
--
Peter Reutemann
Dept. of Computer Science
University of Waikato, NZ
+64 (7) 577-5304
http://www.cms.waikato.ac.nz/~fracpete/
http://www.data-mining.co.nz/
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html