Date: Sun, 22 Sep 2019 13:55:00 -0700 (MST)

From: mcbenly <[hidden email]>

Subject: [Wekalist] Kappa metric for multi-class classification?

Hi,

I am having difficulty choosing best performance measure for my multi-class

classification problem.

There are four classes in my dataset, and data is Imbalanced.

Personally I preferred using weighted f-measure and AUROC for binary

classification. But I guess I can't use AUROC for multi-class

classification. Not sure weighted f-measure alone would be good for

multi-class measurement.

I read in few research papers, that for multi-class problem, use F-measure

micro-macro averaging. Use micro if data is imbalanced.

But as far as I understand micro f-measure averaging is same as

classification accuracy...

I was wondering if I could use "classification accuracy + Kappa Statistics"

as my *main performance measure*? Will this be right combination?

OR any other suggestion you might have?

Thanks, Ben

In those circumstances F1 is not a good choice, and chance-corrected kappa measures are more appropriate, and can be directly applied to multiclass data. You can also macroaverage weighting by the bias to a particular prediction (proportion of
time that class label is predicted) - it is not appropriate to weight by the prevalence (proportion of the time the real class occurs). Accuracy is also easily biased and is misleading to the extent that bias doesn’t match prevalence. To the extent you have
a per class or per instance cost you can use that, but otherwise a chance correct measure is best.

The Cohen Kappa included in Weka is a reasonable but not a good choice (a chance-corrected version of Accuracy), as like F1 it is not good if prediction bias fails to match prevalence for each class. I include a link to a paper on this below.

What is appropriate is the multiclass form of Kappa called Informedness which is chance correct in the sense that it gives the probability of an informed decision (viz. not chance). Again I include links.

The binary form of this is Peirce(1884)’s I and Youden(1950)’s J and Flach(2003)’s deskewed WRAcc and what is known in Psyc as DeltaP'. It corresponds to the distance above the chance line in the ROC curve, viz. tpr-fpr, which is what is maximized
when choosing the standard operating point in ROC. It macroaverages over predictions as described above to estimate the multiclass form of Informedness (and the short ECAI and long JMLT papers show how the Bookmaker estimate recovers the underlying probability
with which a Monte Carlo simulation makes and informed decision or guesses).

This is a hobbyhorse of mine… I originally modelled informedness in terms of gambling on your predictions (hence the multiclass measure is also known as Bookmaker, Bookmaker Informedness or Bookmaker Probability, and that makes it clear why you
should weight classes by their bias - the appropriate weight across horses is how much you bet in on each horse. I have written extensively on this, and including providing Matlab scripts, an eXcel calculator and a version of Weka that provides it as an alternate
evaluation measure (in Explorer and Experimenter as well as Adaboost, which turns it into Adabook). I include a selection below (but e.g. exclude ones about visualizations, including the relation to ROC and AUC - there’s also a paper about why you should never
use F-score, and one that focuses on mutliclass visualizations - both available on arXiv).

Informedness papers

2013 ICINCO Paper+Poster - Adabook & Multibook

<a href="https://dspace.flinders.edu.au/jspui/bitstream/2328/27163/2/Powers Evaluation poster.pdf" class="">https://dspace.flinders.edu.au/jspui/bitstream/2328/27163/2/Powers%20Evaluation%20poster.pdf

2012 EACL Paper+Poster - The Problem with Kappa

<a href="https://dspace.flinders.edu.au/jspui/bitstream/2328/27160/2/Powers Problem poster.pdf" class="">https://dspace.flinders.edu.au/jspui/bitstream/2328/27160/2/Powers%20Problem%20poster.pdf

2011 JMLT - Evaluation: from Precision, Recall and F-measure to ROC, Informedness, Markedness and Correlation

<a href="http://dspace.flinders.edu.au/jspui/bitstream/2328/27165/1/Powers Evaluation.pdf" class="">http://dspace.flinders.edu.au/jspui/bitstream/2328/27165/1/Powers%20Evaluation.pdf

2008 ECAI Paper+Poster+Talk
- Evaluation Evaluation

<a href="http://dspace2.flinders.edu.au/xmlui/bitstream/handle/2328/27163/Powers Evaluation poster TALKY.ppt" class="">http://dspace2.flinders.edu.au/xmlui/bitstream/handle/2328/27163/Powers%20Evaluation%20poster%20TALKY.ppt

2003 ICCS Paper+Poster - Recall and Precision vs the Bookmaker (38)

1998 CoNLL Paper - The Present use of Statistics in evaluation of NLP parsers

You also mentioned liking AUROC. It is important to understand what this actually measures!

ROC AUC gives the probability that a positive prediction is ranked above a negative prediction, and represents a balance between finding a specific operating point (Certainty = (Informedness+1)/2 is then the area under a three point curve) and
how much room there is for distributional variance (Consistency = AUC-Certainty - area between the multipoint curve or convex hull and the three point curve - as discussed in my ROC ConCert paper - I’ve added a link to this).

2012 ROC ConCert