Re: Wekalist Digest, Vol 199, Issue 40

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

Re: Wekalist Digest, Vol 199, Issue 40

Prof. David MW Powers
Date: Sun, 22 Sep 2019 13:55:00 -0700 (MST)
From: mcbenly <[hidden email]>
Subject: [Wekalist] Kappa metric for multi-class classification?

I am having difficulty choosing best performance measure for my multi-class
classification problem. 
There are four classes in my dataset, and data is Imbalanced. 

Personally I preferred using weighted f-measure and AUROC for binary
classification. But I guess I can't use AUROC for multi-class
classification. Not sure weighted f-measure alone would be good for
multi-class measurement. 

I read in few research papers, that for multi-class problem, use F-measure
micro-macro averaging. Use micro if data is imbalanced. 

But as far as I understand micro f-measure averaging is same as
classification accuracy...

I was wondering if I could use "classification accuracy + Kappa Statistics"
as my *main performance measure*? Will this be right combination? 

OR any other suggestion you might have? 

Thanks, Ben

In those circumstances F1 is not a good choice, and chance-corrected kappa measures are more appropriate, and can be directly applied to multiclass data.  You can also macroaverage weighting by the bias to a particular prediction (proportion of time that class label is predicted) - it is not appropriate to weight by the prevalence (proportion of the time the real class occurs). Accuracy is also easily biased and is misleading to the extent that bias doesn’t match prevalence.  To the extent you have a per class or per instance cost you can use that, but otherwise a chance correct measure is best.

The Cohen Kappa included in Weka is a reasonable but not a good choice (a chance-corrected version of Accuracy), as like F1 it is not good if prediction bias fails to match prevalence for each class. I include a link to a paper on this below.

What is appropriate is the multiclass form of Kappa called Informedness which is chance correct in the sense that it gives the probability of an informed decision (viz. not chance). Again I include links.

The binary form of this is Peirce(1884)’s I and Youden(1950)’s J and Flach(2003)’s deskewed WRAcc and what is known in Psyc as DeltaP'. It corresponds to the distance above the chance line in the ROC curve, viz. tpr-fpr, which is what is maximized when choosing the standard operating point in ROC.  It macroaverages over predictions as described above to estimate the multiclass form of Informedness (and the short ECAI and long JMLT papers show how the Bookmaker estimate recovers the underlying probability with which a Monte Carlo simulation makes and informed decision or guesses).

This is a hobbyhorse of mine… I originally modelled informedness in terms of gambling on your predictions (hence the multiclass measure is also known as Bookmaker, Bookmaker Informedness or Bookmaker Probability, and that makes it clear why you should weight classes by their bias - the appropriate weight across horses is how much you bet in on each horse. I have written extensively on this, and including providing Matlab scripts, an eXcel calculator and a version of Weka that provides it as an alternate evaluation measure (in Explorer and Experimenter as well as Adaboost, which turns it into Adabook). I include a selection below (but e.g. exclude ones about visualizations, including the relation to ROC and AUC - there’s also a paper about why you should never use F-score, and one that focuses on mutliclass visualizations - both available on arXiv). 

Informedness papers
2013 ICINCO Paper+Poster - Adabook & Multibook 
<a href=" Evaluation poster.pdf" class="">

2012 EACL Paper+Poster - The Problem with Kappa 
<a href=" Problem poster.pdf" class="">

2011  JMLT - Evaluation: from Precision, Recall and F-measure to ROC, Informedness, Markedness and Correlation
<a href=" Evaluation.pdf" class="">

2008 ECAI Paper+Poster+Talk  - Evaluation Evaluation
<a href=" Evaluation poster TALKY.ppt" class="">


2003 ICCS Paper+Poster - Recall and Precision vs the Bookmaker (38)


1998 CoNLL Paper - The Present use of Statistics in evaluation of NLP parsers 

You also mentioned liking AUROC. It is important to understand what this actually measures! 

ROC AUC gives the probability that a positive prediction is ranked above a negative prediction, and represents a balance between finding a specific operating point (Certainty = (Informedness+1)/2 is then the area under a three point curve) and how much room there is for distributional variance (Consistency = AUC-Certainty - area between the multipoint curve or convex hull and the three point curve - as discussed in my ROC ConCert paper - I’ve added a link to this).

2012 ROC ConCert

Prof. David M W Powers, Ph.D.
                                                                   mail: [hidden email]

Professor of Computer Science & Cognitive Science, TON2.10
South Australia Research Director,  ARC ITRH Digital Enhanced Living Hub

College of Science and Engineering                              (Phone: 08-8201 3663)
Flinders University, Tonsley, South Australia 5042       (Fax: +61-8-8201 3626)
GPO Box 2100 Adelaide SA 5001                       (Mobile/Viber: 0414-824-307)

Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
List etiquette: