Unbalanced datasets: How to compare the Precision of a weka classifier against the Precision of a guess classifier

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

Unbalanced datasets: How to compare the Precision of a weka classifier against the Precision of a guess classifier

Marina Santini
Hi, 

I am working with unbalanced datasets, similar to the diabetes dataset (500 tested-negative and 268 tested-positive, total instances: 768) .

I wish to measure how much the Precision on the minority class of a Naive Bayes classifier  (standard parameters) improves wrt the Precision of a random classifier on the same class. 

Since the dataset is unbalanced, it does not make sense to use ZeroR, since ZeroR only guesses the majority class.

Since I thought that is NOT theoretically sound to compare an evaluation measure like Precision (TP/TP+FP) to the probability distribution of the individual classes (ie. total instances of a class OVER the whole population), I was suggested to use a random classifier, or better a weighted guess classifier, as described in these pages: 

Making the calculations (hopefully they are correct, but I cannot swear on them :-) ), it turns out that the Precision of the weighted guess classifier always corresponds to  the probability distribution of the individual classes. That is, in the case on the diabetes dataset, the distribution probability of the minority class (ie the "tested-positive")  is 0.348 which is identical (according to my calculations) to the Precision on that class returned by the weighted guess classifier.

The Naive Bayes' precision on the "tested-positive" class is 0,678. 

Can I safely (and soundly) say that the NB improves 0.33 points on the precision of a random classifier?

Do you know if there exists any implementations of a weighted guess classifier? Making all the calculation is tedious...
It would be handy to have it  available in the weka workbench and use it as a baseline.

Thanks in advance for your answer. 

Cheers, Marina



 

_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Unbalanced datasets: How to compare the Precision of a weka classifier against the Precision of a guess classifier

Eibe Frank-2
Administrator

> On 30/05/2017, at 3:36 AM, Marina Santini <[hidden email]> wrote:
>
> I am working with unbalanced datasets, similar to the diabetes dataset (500 tested-negative and 268 tested-positive, total instances: 768) .
>
> I wish to measure how much the Precision on the minority class of a Naive Bayes classifier  (standard parameters) improves wrt the Precision of a random classifier on the same class.
>
> Since the dataset is unbalanced, it does not make sense to use ZeroR, since ZeroR only guesses the majority class.
>
> Since I thought that is NOT theoretically sound to compare an evaluation measure like Precision (TP/TP+FP) to the probability distribution of the individual classes (ie. total instances of a class OVER the whole population), I was suggested to use a random classifier, or better a weighted guess classifier, as described in these pages:
> http://blog.revolutionanalytics.com/2016/03/classification-models.html
> https://github.com/shaheeng/ClassificationModelEvaluation/blob/master/Baseline%20Metrics_Shaheen_article.pdf
>
> Making the calculations (hopefully they are correct, but I cannot swear on them :-) ), it turns out that the Precision of the weighted guess classifier always corresponds to  the probability distribution of the individual classes. That is, in the case on the diabetes dataset, the distribution probability of the minority class (ie the "tested-positive")  is 0.348 which is identical (according to my calculations) to the Precision on that class returned by the weighted guess classifier.
>
> The Naive Bayes' precision on the "tested-positive" class is 0,678.
>
> Can I safely (and soundly) say that the NB improves 0.33 points on the precision of a random classifier?

Yes, but just quoting precision is not sufficient. You also need to quote recall.

It might be more useful to compare to a random ranker instead (i.e., a classifier whose probability estimates are used to rank the instances according to their probability of belonging to the positive class). You could compare the area under the precision recall curve for naive Bayes to that of a random ranker. The area for the random ranker is given by the proportion of positive instances in the test data.

> Do you know if there exists any implementations of a weighted guess classifier? Making all the calculation is tedious...
> It would be handy to have it  available in the weka workbench and use it as a baseline.

Precision and recall are are given by the proportion of positive instances in the test data, so no calculation is necessary if you know that proportion (see the article you linked to: http://blog.revolutionanalytics.com/2016/03/classification-models.html).

For normalised classification accuracy, use kappa (https://stats.stackexchange.com/questions/82162/cohens-kappa-in-plain-english), which WEKA does output.

Cheers,
Eibe

_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Unbalanced datasets: How to compare the Precision of a weka classifier against the Precision of a guess classifier

Marina Santini
Thanks Eibe!

I used the ranker only for attribute selection. 
I do not know how to use the ranker independently (as an independent classifier) in the weka Explorer. Could you pls instruct me on that?

Big thanks for your exhaustive and quick replies.

Cheers, Marina

 

On 30 May 2017 at 06:11, Eibe Frank <[hidden email]> wrote:

> On 30/05/2017, at 3:36 AM, Marina Santini <[hidden email]> wrote:
>
> I am working with unbalanced datasets, similar to the diabetes dataset (500 tested-negative and 268 tested-positive, total instances: 768) .
>
> I wish to measure how much the Precision on the minority class of a Naive Bayes classifier  (standard parameters) improves wrt the Precision of a random classifier on the same class.
>
> Since the dataset is unbalanced, it does not make sense to use ZeroR, since ZeroR only guesses the majority class.
>
> Since I thought that is NOT theoretically sound to compare an evaluation measure like Precision (TP/TP+FP) to the probability distribution of the individual classes (ie. total instances of a class OVER the whole population), I was suggested to use a random classifier, or better a weighted guess classifier, as described in these pages:
> http://blog.revolutionanalytics.com/2016/03/classification-models.html
> https://github.com/shaheeng/ClassificationModelEvaluation/blob/master/Baseline%20Metrics_Shaheen_article.pdf
>
> Making the calculations (hopefully they are correct, but I cannot swear on them :-) ), it turns out that the Precision of the weighted guess classifier always corresponds to  the probability distribution of the individual classes. That is, in the case on the diabetes dataset, the distribution probability of the minority class (ie the "tested-positive")  is 0.348 which is identical (according to my calculations) to the Precision on that class returned by the weighted guess classifier.
>
> The Naive Bayes' precision on the "tested-positive" class is 0,678.
>
> Can I safely (and soundly) say that the NB improves 0.33 points on the precision of a random classifier?

Yes, but just quoting precision is not sufficient. You also need to quote recall.

It might be more useful to compare to a random ranker instead (i.e., a classifier whose probability estimates are used to rank the instances according to their probability of belonging to the positive class). You could compare the area under the precision recall curve for naive Bayes to that of a random ranker. The area for the random ranker is given by the proportion of positive instances in the test data.

> Do you know if there exists any implementations of a weighted guess classifier? Making all the calculation is tedious...
> It would be handy to have it  available in the weka workbench and use it as a baseline.

Precision and recall are are given by the proportion of positive instances in the test data, so no calculation is necessary if you know that proportion (see the article you linked to: http://blog.revolutionanalytics.com/2016/03/classification-models.html).

For normalised classification accuracy, use kappa (https://stats.stackexchange.com/questions/82162/cohens-kappa-in-plain-english), which WEKA does output.

Cheers,
Eibe

_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html


_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Unbalanced datasets: How to compare the Precision of a weka classifier against the Precision of a guess classifier

Eibe Frank-2
Administrator
There isn't actually a random ranker in WEKA. The area under the precision-recall curve for the random ranker is given by the proportion of positive instances in the test data so there is no need to actually run the experiment with the random ranker.

Cheers,
Eibe


> On 30 May 2017, at 19:08, Marina Santini <[hidden email]> wrote:
>
> Thanks Eibe!
>
> I used the ranker only for attribute selection.
> I do not know how to use the ranker independently (as an independent classifier) in the weka Explorer. Could you pls instruct me on that?
>
> Big thanks for your exhaustive and quick replies.
>
> Cheers, Marina
>
>  
>
> On 30 May 2017 at 06:11, Eibe Frank <[hidden email]> wrote:
>
> > On 30/05/2017, at 3:36 AM, Marina Santini <[hidden email]> wrote:
> >
> > I am working with unbalanced datasets, similar to the diabetes dataset (500 tested-negative and 268 tested-positive, total instances: 768) .
> >
> > I wish to measure how much the Precision on the minority class of a Naive Bayes classifier  (standard parameters) improves wrt the Precision of a random classifier on the same class.
> >
> > Since the dataset is unbalanced, it does not make sense to use ZeroR, since ZeroR only guesses the majority class.
> >
> > Since I thought that is NOT theoretically sound to compare an evaluation measure like Precision (TP/TP+FP) to the probability distribution of the individual classes (ie. total instances of a class OVER the whole population), I was suggested to use a random classifier, or better a weighted guess classifier, as described in these pages:
> > http://blog.revolutionanalytics.com/2016/03/classification-models.html
> > https://github.com/shaheeng/ClassificationModelEvaluation/blob/master/Baseline%20Metrics_Shaheen_article.pdf
> >
> > Making the calculations (hopefully they are correct, but I cannot swear on them :-) ), it turns out that the Precision of the weighted guess classifier always corresponds to  the probability distribution of the individual classes. That is, in the case on the diabetes dataset, the distribution probability of the minority class (ie the "tested-positive")  is 0.348 which is identical (according to my calculations) to the Precision on that class returned by the weighted guess classifier.
> >
> > The Naive Bayes' precision on the "tested-positive" class is 0,678.
> >
> > Can I safely (and soundly) say that the NB improves 0.33 points on the precision of a random classifier?
>
> Yes, but just quoting precision is not sufficient. You also need to quote recall.
>
> It might be more useful to compare to a random ranker instead (i.e., a classifier whose probability estimates are used to rank the instances according to their probability of belonging to the positive class). You could compare the area under the precision recall curve for naive Bayes to that of a random ranker. The area for the random ranker is given by the proportion of positive instances in the test data.
>
> > Do you know if there exists any implementations of a weighted guess classifier? Making all the calculation is tedious...
> > It would be handy to have it  available in the weka workbench and use it as a baseline.
>
> Precision and recall are are given by the proportion of positive instances in the test data, so no calculation is necessary if you know that proportion (see the article you linked to: http://blog.revolutionanalytics.com/2016/03/classification-models.html).
>
> For normalised classification accuracy, use kappa (https://stats.stackexchange.com/questions/82162/cohens-kappa-in-plain-english), which WEKA does output.
>
> Cheers,
> Eibe
>
> _______________________________________________
> Wekalist mailing list
> Send posts to: [hidden email]
> List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
> List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
>
> _______________________________________________
> Wekalist mailing list
> Send posts to: [hidden email]
> List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
> List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Unbalanced datasets: How to compare the Precision of a weka classifier against the Precision of a guess classifier

Marina Santini
Thanks!

Cheers, Marina

On 30 May 2017 at 12:51, Eibe Frank <[hidden email]> wrote:
There isn't actually a random ranker in WEKA. The area under the precision-recall curve for the random ranker is given by the proportion of positive instances in the test data so there is no need to actually run the experiment with the random ranker.

Cheers,
Eibe


> On 30 May 2017, at 19:08, Marina Santini <[hidden email]> wrote:
>
> Thanks Eibe!
>
> I used the ranker only for attribute selection.
> I do not know how to use the ranker independently (as an independent classifier) in the weka Explorer. Could you pls instruct me on that?
>
> Big thanks for your exhaustive and quick replies.
>
> Cheers, Marina
>
>
>
> On 30 May 2017 at 06:11, Eibe Frank <[hidden email]> wrote:
>
> > On 30/05/2017, at 3:36 AM, Marina Santini <[hidden email]> wrote:
> >
> > I am working with unbalanced datasets, similar to the diabetes dataset (500 tested-negative and 268 tested-positive, total instances: 768) .
> >
> > I wish to measure how much the Precision on the minority class of a Naive Bayes classifier  (standard parameters) improves wrt the Precision of a random classifier on the same class.
> >
> > Since the dataset is unbalanced, it does not make sense to use ZeroR, since ZeroR only guesses the majority class.
> >
> > Since I thought that is NOT theoretically sound to compare an evaluation measure like Precision (TP/TP+FP) to the probability distribution of the individual classes (ie. total instances of a class OVER the whole population), I was suggested to use a random classifier, or better a weighted guess classifier, as described in these pages:
> > http://blog.revolutionanalytics.com/2016/03/classification-models.html
> > https://github.com/shaheeng/ClassificationModelEvaluation/blob/master/Baseline%20Metrics_Shaheen_article.pdf
> >
> > Making the calculations (hopefully they are correct, but I cannot swear on them :-) ), it turns out that the Precision of the weighted guess classifier always corresponds to  the probability distribution of the individual classes. That is, in the case on the diabetes dataset, the distribution probability of the minority class (ie the "tested-positive")  is 0.348 which is identical (according to my calculations) to the Precision on that class returned by the weighted guess classifier.
> >
> > The Naive Bayes' precision on the "tested-positive" class is 0,678.
> >
> > Can I safely (and soundly) say that the NB improves 0.33 points on the precision of a random classifier?
>
> Yes, but just quoting precision is not sufficient. You also need to quote recall.
>
> It might be more useful to compare to a random ranker instead (i.e., a classifier whose probability estimates are used to rank the instances according to their probability of belonging to the positive class). You could compare the area under the precision recall curve for naive Bayes to that of a random ranker. The area for the random ranker is given by the proportion of positive instances in the test data.
>
> > Do you know if there exists any implementations of a weighted guess classifier? Making all the calculation is tedious...
> > It would be handy to have it  available in the weka workbench and use it as a baseline.
>
> Precision and recall are are given by the proportion of positive instances in the test data, so no calculation is necessary if you know that proportion (see the article you linked to: http://blog.revolutionanalytics.com/2016/03/classification-models.html).
>
> For normalised classification accuracy, use kappa (https://stats.stackexchange.com/questions/82162/cohens-kappa-in-plain-english), which WEKA does output.
>
> Cheers,
> Eibe
>
> _______________________________________________
> Wekalist mailing list
> Send posts to: [hidden email]
> List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
> List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
>
> _______________________________________________
> Wekalist mailing list
> Send posts to: [hidden email]
> List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
> List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html


_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Unbalanced datasets: How to compare the Precision of a weka classifier against the Precision of a guess classifier

Shu-Ju Tu
Hi

I just started learning machine learning with Weka and I found this
question interesting.

What does that mean exactly the unbalanced data set?

Does that mean 50%-50% in a 2-class data set a "balanced" data set?

In the diabetes 2-class data set
Positive class instances: 268 (35%)
Negative class instances: 500 (65%)

Marina, is your question related to increasing the predictive accuracy
of positive class?

Warm regards,
Shu-Ju



2017-05-30 18:58 GMT+08:00 Marina Santini <[hidden email]>:

> Thanks!
>
> Cheers, Marina
>
> On 30 May 2017 at 12:51, Eibe Frank <[hidden email]> wrote:
>>
>> There isn't actually a random ranker in WEKA. The area under the
>> precision-recall curve for the random ranker is given by the proportion of
>> positive instances in the test data so there is no need to actually run the
>> experiment with the random ranker.
>>
>> Cheers,
>> Eibe
>>
>>
>> > On 30 May 2017, at 19:08, Marina Santini <[hidden email]>
>> > wrote:
>> >
>> > Thanks Eibe!
>> >
>> > I used the ranker only for attribute selection.
>> > I do not know how to use the ranker independently (as an independent
>> > classifier) in the weka Explorer. Could you pls instruct me on that?
>> >
>> > Big thanks for your exhaustive and quick replies.
>> >
>> > Cheers, Marina
>> >
>> >
>> >
>> > On 30 May 2017 at 06:11, Eibe Frank <[hidden email]> wrote:
>> >
>> > > On 30/05/2017, at 3:36 AM, Marina Santini
>> > > <[hidden email]> wrote:
>> > >
>> > > I am working with unbalanced datasets, similar to the diabetes dataset
>> > > (500 tested-negative and 268 tested-positive, total instances: 768) .
>> > >
>> > > I wish to measure how much the Precision on the minority class of a
>> > > Naive Bayes classifier  (standard parameters) improves wrt the Precision of
>> > > a random classifier on the same class.
>> > >
>> > > Since the dataset is unbalanced, it does not make sense to use ZeroR,
>> > > since ZeroR only guesses the majority class.
>> > >
>> > > Since I thought that is NOT theoretically sound to compare an
>> > > evaluation measure like Precision (TP/TP+FP) to the probability distribution
>> > > of the individual classes (ie. total instances of a class OVER the whole
>> > > population), I was suggested to use a random classifier, or better a
>> > > weighted guess classifier, as described in these pages:
>> > > http://blog.revolutionanalytics.com/2016/03/classification-models.html
>> > >
>> > > https://github.com/shaheeng/ClassificationModelEvaluation/blob/master/Baseline%20Metrics_Shaheen_article.pdf
>> > >
>> > > Making the calculations (hopefully they are correct, but I cannot
>> > > swear on them :-) ), it turns out that the Precision of the weighted guess
>> > > classifier always corresponds to  the probability distribution of the
>> > > individual classes. That is, in the case on the diabetes dataset, the
>> > > distribution probability of the minority class (ie the "tested-positive")
>> > > is 0.348 which is identical (according to my calculations) to the Precision
>> > > on that class returned by the weighted guess classifier.
>> > >
>> > > The Naive Bayes' precision on the "tested-positive" class is 0,678.
>> > >
>> > > Can I safely (and soundly) say that the NB improves 0.33 points on the
>> > > precision of a random classifier?
>> >
>> > Yes, but just quoting precision is not sufficient. You also need to
>> > quote recall.
>> >
>> > It might be more useful to compare to a random ranker instead (i.e., a
>> > classifier whose probability estimates are used to rank the instances
>> > according to their probability of belonging to the positive class). You
>> > could compare the area under the precision recall curve for naive Bayes to
>> > that of a random ranker. The area for the random ranker is given by the
>> > proportion of positive instances in the test data.
>> >
>> > > Do you know if there exists any implementations of a weighted guess
>> > > classifier? Making all the calculation is tedious...
>> > > It would be handy to have it  available in the weka workbench and use
>> > > it as a baseline.
>> >
>> > Precision and recall are are given by the proportion of positive
>> > instances in the test data, so no calculation is necessary if you know that
>> > proportion (see the article you linked to:
>> > http://blog.revolutionanalytics.com/2016/03/classification-models.html).
>> >
>> > For normalised classification accuracy, use kappa
>> > (https://stats.stackexchange.com/questions/82162/cohens-kappa-in-plain-english),
>> > which WEKA does output.
>> >
>> > Cheers,
>> > Eibe
>> >
>> > _______________________________________________
>> > Wekalist mailing list
>> > Send posts to: [hidden email]
>> > List info and subscription status:
>> > https://list.waikato.ac.nz/mailman/listinfo/wekalist
>> > List etiquette:
>> > http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
>> >
>> > _______________________________________________
>> > Wekalist mailing list
>> > Send posts to: [hidden email]
>> > List info and subscription status:
>> > https://list.waikato.ac.nz/mailman/listinfo/wekalist
>> > List etiquette:
>> > http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
>>
>> _______________________________________________
>> Wekalist mailing list
>> Send posts to: [hidden email]
>> List info and subscription status:
>> https://list.waikato.ac.nz/mailman/listinfo/wekalist
>> List etiquette:
>> http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
>
>
>
> _______________________________________________
> Wekalist mailing list
> Send posts to: [hidden email]
> List info and subscription status:
> https://list.waikato.ac.nz/mailman/listinfo/wekalist
> List etiquette:
> http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
>
_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Unbalanced datasets: How to compare the Precision of a weka classifier against the Precision of a guess classifier

Eibe Frank-2
Administrator

> On 31/05/2017, at 1:22 PM, Shu-Ju Tu <[hidden email]> wrote:
>
> What does that mean exactly the unbalanced data set?
>
> Does that mean 50%-50% in a 2-class data set a "balanced" data set?

Yes.

> In the diabetes 2-class data set
> Positive class instances: 268 (35%)
> Negative class instances: 500 (65%)

This dataset isn’t particularly imbalanced. The “sick” dataset in

  http://prdownloads.sourceforge.net/weka/datasets-UCI.jar

is quite imbalanced:

ZeroR predicts class value: negative

Time taken to build model: 0 seconds
Time taken to test model on training data: 0.04 seconds

=== Error on training data ===

Correctly Classified Instances        3541               93.8759 %
Incorrectly Classified Instances       231                6.1241 %
Kappa statistic                          0    
EER                                      0.5  
Quadratic Weighted Kappa 0.0Mean absolute error                      0.1152
Root mean squared error                  0.2398
Relative absolute error                100      %
Root relative squared error            100      %
Total Number of Instances             3772    


=== Detailed Accuracy By Class ===

                 TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
                 1.000    1.000    0.939      1.000    0.968      0.000    0.500     0.939     negative
                 0.000    0.000    0.000      0.000    0.000      0.000    0.500     0.061     sick
Weighted Avg.    0.939    0.939    0.881      0.939    0.909      0.000    0.500     0.885    


=== Confusion Matrix ===

    a    b   <-- classified as
 3541    0 |    a = negative
  231    0 |    b = sick

Cheers,
Eibe
_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Unbalanced datasets: How to compare the Precision of a weka classifier against the Precision of a guess classifier

Marina Santini
Hi, 

A follow up question, maybe you have already explained this or implied what I am going to ask, but... i would like to find the words to find a clear statement on the performance of a classifier with respect to a random baseline on the minority class. 

This are my results:

the class proportions are the following: 

negative = 21361/32542 = 0.656

positive = 11181/32542 = 0.344



=== Stratified cross-validation ===
=== Summary ===

Correctly Classified Instances       22007               67.6265 %
Incorrectly Classified Instances     10535               32.3735 %
Kappa statistic                          0.1274

Total Number of Instances            32542     

=== Detailed Accuracy By Class ===

                 TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
                  0,948    0,844        0,682        0,948    0,794         0,176         0,592     0,724     negative
                  0,156    0,052        0,613        0,156    0,249         0,176         0,592     0,460     positive
Weighted Avg.    0,676    0,572    0,659      0,676    0,607      0,176    0,592     0,633     

=== Confusion Matrix ===

     a     b   <-- classified as
 20259  1102 |     a = negative
  9433  1748 |     b = >positive


How shall I interpret the value of PRC Area of the positive class (ie 0.460)? What does this value tell me in plain English? Can I compare it directly to the class proportion (ie  0.344) and conclude that my classifier does 0.116 points better than a random baseline?

In my view, this classifier performs poorly on the positive class, that is just above random, regardless a precision of 0.61. 

How would you assess this performance on the positive class?

Thanks in advance.

Cheers, Marina




On 31 May 2017 at 06:18, Eibe Frank <[hidden email]> wrote:

> On 31/05/2017, at 1:22 PM, Shu-Ju Tu <[hidden email]> wrote:
>
> What does that mean exactly the unbalanced data set?
>
> Does that mean 50%-50% in a 2-class data set a "balanced" data set?

Yes.

> In the diabetes 2-class data set
> Positive class instances: 268 (35%)
> Negative class instances: 500 (65%)

This dataset isn’t particularly imbalanced. The “sick” dataset in

  http://prdownloads.sourceforge.net/weka/datasets-UCI.jar

is quite imbalanced:

ZeroR predicts class value: negative

Time taken to build model: 0 seconds
Time taken to test model on training data: 0.04 seconds

=== Error on training data ===

Correctly Classified Instances        3541               93.8759 %
Incorrectly Classified Instances       231                6.1241 %
Kappa statistic                          0
EER                                      0.5
Quadratic Weighted Kappa 0.0Mean absolute error                      0.1152
Root mean squared error                  0.2398
Relative absolute error                100      %
Root relative squared error            100      %
Total Number of Instances             3772


=== Detailed Accuracy By Class ===

                 TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
                 1.000    1.000    0.939      1.000    0.968      0.000    0.500     0.939     negative
                 0.000    0.000    0.000      0.000    0.000      0.000    0.500     0.061     sick
Weighted Avg.    0.939    0.939    0.881      0.939    0.909      0.000    0.500     0.885


=== Confusion Matrix ===

    a    b   <-- classified as
 3541    0 |    a = negative
  231    0 |    b = sick

Cheers,
Eibe
_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html


_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Unbalanced datasets: How to compare the Precision of a weka classifier against the Precision of a guess classifier

Eibe Frank-2
Administrator
Yes, it looks pretty poor. This is consistent with the result you got for the ROC area. The ROC area is pretty close to 0.5, which is the ROC area you would get for a random ranker.

You should also check the precision-recall curve itself, by right-clicking on the classifier’s entry in the result list and selecting “Visualize threshold curve”.

Cheers,
Eibe

> On 31/05/2017, at 10:13 PM, Marina Santini <[hidden email]> wrote:
>
> Hi,
>
> A follow up question, maybe you have already explained this or implied what I am going to ask, but... i would like to find the words to find a clear statement on the performance of a classifier with respect to a random baseline on the minority class.
>
> This are my results:
>
> the class proportions are the following:
> negative = 21361/32542 = 0.656
>
> positive = 11181/32542 = 0.344
>
>
>
> === Stratified cross-validation ===
> === Summary ===
>
> Correctly Classified Instances       22007               67.6265 %
> Incorrectly Classified Instances     10535               32.3735 %
> Kappa statistic                          0.1274
>
> Total Number of Instances            32542    
>
> === Detailed Accuracy By Class ===
>
>                  TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
>                   0,948    0,844        0,682        0,948    0,794         0,176         0,592     0,724     negative
>                   0,156    0,052        0,613        0,156    0,249         0,176         0,592     0,460     positive
> Weighted Avg.    0,676    0,572    0,659      0,676    0,607      0,176    0,592     0,633    
>
> === Confusion Matrix ===
>
>      a     b   <-- classified as
>  20259  1102 |     a = negative
>   9433  1748 |     b = >positive
>
>
> How shall I interpret the value of PRC Area of the positive class (ie 0.460)? What does this value tell me in plain English? Can I compare it directly to the class proportion (ie  0.344) and conclude that my classifier does 0.116 points better than a random baseline?
>
> In my view, this classifier performs poorly on the positive class, that is just above random, regardless a precision of 0.61.
>
> How would you assess this performance on the positive class?
>
> Thanks in advance.
>
> Cheers, Marina
>
>
>
>
> On 31 May 2017 at 06:18, Eibe Frank <[hidden email]> wrote:
>
> > On 31/05/2017, at 1:22 PM, Shu-Ju Tu <[hidden email]> wrote:
> >
> > What does that mean exactly the unbalanced data set?
> >
> > Does that mean 50%-50% in a 2-class data set a "balanced" data set?
>
> Yes.
>
> > In the diabetes 2-class data set
> > Positive class instances: 268 (35%)
> > Negative class instances: 500 (65%)
>
> This dataset isn’t particularly imbalanced. The “sick” dataset in
>
>   http://prdownloads.sourceforge.net/weka/datasets-UCI.jar
>
> is quite imbalanced:
>
> ZeroR predicts class value: negative
>
> Time taken to build model: 0 seconds
> Time taken to test model on training data: 0.04 seconds
>
> === Error on training data ===
>
> Correctly Classified Instances        3541               93.8759 %
> Incorrectly Classified Instances       231                6.1241 %
> Kappa statistic                          0
> EER                                      0.5
> Quadratic Weighted Kappa 0.0Mean absolute error                      0.1152
> Root mean squared error                  0.2398
> Relative absolute error                100      %
> Root relative squared error            100      %
> Total Number of Instances             3772
>
>
> === Detailed Accuracy By Class ===
>
>                  TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
>                  1.000    1.000    0.939      1.000    0.968      0.000    0.500     0.939     negative
>                  0.000    0.000    0.000      0.000    0.000      0.000    0.500     0.061     sick
> Weighted Avg.    0.939    0.939    0.881      0.939    0.909      0.000    0.500     0.885
>
>
> === Confusion Matrix ===
>
>     a    b   <-- classified as
>  3541    0 |    a = negative
>   231    0 |    b = sick
>
> Cheers,
> Eibe
> _______________________________________________
> Wekalist mailing list
> Send posts to: [hidden email]
> List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
> List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
>
> _______________________________________________
> Wekalist mailing list
> Send posts to: [hidden email]
> List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
> List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Unbalanced datasets: How to compare the Precision of a weka classifier against the Precision of a guess classifier

Marina Santini
Hi, 

here is the PRCurve. 


Inline images 1


Is there any rule of thumb to assess the goodness of this curve? 

If I understand correctly, the baseline of this curve is the precision of a random classifier on the same class., ie. 0.34 (that corresponds to the probability class distribution). 
This area under this curve is 0.46. The PRC area of a perfect classifier is 1.0.  
Therefore we have to assess this curve wrt to the interval 0.34 and 1.0.

Is there any scale that we can use to interpret this interval? 

Intuitively, I would say that this curve indicates a performance just above random and tells us that the classifier is not doing a good job on that class. But I am not sure if I  am too pessimistic in this interpretation. 

What would be a fair value? 0.70 maybe ?

iI you could suggest any references indicating how to interpret the values of a PRC area, I would be very grateful.

Thanks a lot for your help

Cheers, Marina


On 1 June 2017 at 00:01, Eibe Frank <[hidden email]> wrote:
Yes, it looks pretty poor. This is consistent with the result you got for the ROC area. The ROC area is pretty close to 0.5, which is the ROC area you would get for a random ranker.

You should also check the precision-recall curve itself, by right-clicking on the classifier’s entry in the result list and selecting “Visualize threshold curve”.

Cheers,
Eibe

> On 31/05/2017, at 10:13 PM, Marina Santini <[hidden email]> wrote:
>
> Hi,
>
> A follow up question, maybe you have already explained this or implied what I am going to ask, but... i would like to find the words to find a clear statement on the performance of a classifier with respect to a random baseline on the minority class.
>
> This are my results:
>
> the class proportions are the following:
> negative = <a href="tel:21361%2F32542" value="+12136132542">21361/32542 = 0.656
>
> positive = 11181/32542 = 0.344
>
>
>
> === Stratified cross-validation ===
> === Summary ===
>
> Correctly Classified Instances       22007               67.6265 %
> Incorrectly Classified Instances     10535               32.3735 %
> Kappa statistic                          0.1274
>
> Total Number of Instances            32542
>
> === Detailed Accuracy By Class ===
>
>                  TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
>                   0,948    0,844        0,682        0,948    0,794         0,176         0,592     0,724     negative
>                   0,156    0,052        0,613        0,156    0,249         0,176         0,592     0,460     positive
> Weighted Avg.    0,676    0,572    0,659      0,676    0,607      0,176    0,592     0,633
>
> === Confusion Matrix ===
>
>      a     b   <-- classified as
>  20259  1102 |     a = negative
>   9433  1748 |     b = >positive
>
>
> How shall I interpret the value of PRC Area of the positive class (ie 0.460)? What does this value tell me in plain English? Can I compare it directly to the class proportion (ie  0.344) and conclude that my classifier does 0.116 points better than a random baseline?
>
> In my view, this classifier performs poorly on the positive class, that is just above random, regardless a precision of 0.61.
>
> How would you assess this performance on the positive class?
>
> Thanks in advance.
>
> Cheers, Marina
>
>
>
>
> On 31 May 2017 at 06:18, Eibe Frank <[hidden email]> wrote:
>
> > On 31/05/2017, at 1:22 PM, Shu-Ju Tu <[hidden email]> wrote:
> >
> > What does that mean exactly the unbalanced data set?
> >
> > Does that mean 50%-50% in a 2-class data set a "balanced" data set?
>
> Yes.
>
> > In the diabetes 2-class data set
> > Positive class instances: 268 (35%)
> > Negative class instances: 500 (65%)
>
> This dataset isn’t particularly imbalanced. The “sick” dataset in
>
>   http://prdownloads.sourceforge.net/weka/datasets-UCI.jar
>
> is quite imbalanced:
>
> ZeroR predicts class value: negative
>
> Time taken to build model: 0 seconds
> Time taken to test model on training data: 0.04 seconds
>
> === Error on training data ===
>
> Correctly Classified Instances        3541               93.8759 %
> Incorrectly Classified Instances       231                6.1241 %
> Kappa statistic                          0
> EER                                      0.5
> Quadratic Weighted Kappa 0.0Mean absolute error                      0.1152
> Root mean squared error                  0.2398
> Relative absolute error                100      %
> Root relative squared error            100      %
> Total Number of Instances             3772
>
>
> === Detailed Accuracy By Class ===
>
>                  TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
>                  1.000    1.000    0.939      1.000    0.968      0.000    0.500     0.939     negative
>                  0.000    0.000    0.000      0.000    0.000      0.000    0.500     0.061     sick
> Weighted Avg.    0.939    0.939    0.881      0.939    0.909      0.000    0.500     0.885
>
>
> === Confusion Matrix ===
>
>     a    b   <-- classified as
>  3541    0 |    a = negative
>   231    0 |    b = sick
>
> Cheers,
> Eibe
> _______________________________________________
> Wekalist mailing list
> Send posts to: [hidden email]
> List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
> List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
>
> _______________________________________________
> Wekalist mailing list
> Send posts to: [hidden email]
> List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
> List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html


_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

Screen Shot 2017-06-01 at 10.20.35 AM.png (49K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Unbalanced datasets: How to compare the Precision of a weka classifier against the Precision of a guess classifier

Eibe Frank-2
Administrator
The curve actually shows that the classifier performs much better than a random ranker for very small values of recall! That’s why it can be important to look at the actual curve: just considering the area under the curve we wouldn’t have seen this.

According to the precision at K (P@K) metric, which WEKA unfortunately doesn’t output, your classifier would be a good one for values of K that are not excessively large.

Cheers,
Eibe

> On 1/06/2017, at 8:49 PM, Marina Santini <[hidden email]> wrote:
>
> Hi,
>
> here is the PRCurve.
>
>
> <image.png>
>
>
> Is there any rule of thumb to assess the goodness of this curve?
>
> If I understand correctly, the baseline of this curve is the precision of a random classifier on the same class., ie. 0.34 (that corresponds to the probability class distribution).
> This area under this curve is 0.46. The PRC area of a perfect classifier is 1.0.  
> Therefore we have to assess this curve wrt to the interval 0.34 and 1.0.
>
> Is there any scale that we can use to interpret this interval?
>
> Intuitively, I would say that this curve indicates a performance just above random and tells us that the classifier is not doing a good job on that class. But I am not sure if I  am too pessimistic in this interpretation.
>
> What would be a fair value? 0.70 maybe ?
>
> iI you could suggest any references indicating how to interpret the values of a PRC area, I would be very grateful.
>
> Thanks a lot for your help
>
> Cheers, Marina
>
>
> On 1 June 2017 at 00:01, Eibe Frank <[hidden email]> wrote:
> Yes, it looks pretty poor. This is consistent with the result you got for the ROC area. The ROC area is pretty close to 0.5, which is the ROC area you would get for a random ranker.
>
> You should also check the precision-recall curve itself, by right-clicking on the classifier’s entry in the result list and selecting “Visualize threshold curve”.
>
> Cheers,
> Eibe
>
> > On 31/05/2017, at 10:13 PM, Marina Santini <[hidden email]> wrote:
> >
> > Hi,
> >
> > A follow up question, maybe you have already explained this or implied what I am going to ask, but... i would like to find the words to find a clear statement on the performance of a classifier with respect to a random baseline on the minority class.
> >
> > This are my results:
> >
> > the class proportions are the following:
> > negative = 21361/32542 = 0.656
> >
> > positive = 11181/32542 = 0.344
> >
> >
> >
> > === Stratified cross-validation ===
> > === Summary ===
> >
> > Correctly Classified Instances       22007               67.6265 %
> > Incorrectly Classified Instances     10535               32.3735 %
> > Kappa statistic                          0.1274
> >
> > Total Number of Instances            32542
> >
> > === Detailed Accuracy By Class ===
> >
> >                  TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
> >                   0,948    0,844        0,682        0,948    0,794         0,176         0,592     0,724     negative
> >                   0,156    0,052        0,613        0,156    0,249         0,176         0,592     0,460     positive
> > Weighted Avg.    0,676    0,572    0,659      0,676    0,607      0,176    0,592     0,633
> >
> > === Confusion Matrix ===
> >
> >      a     b   <-- classified as
> >  20259  1102 |     a = negative
> >   9433  1748 |     b = >positive
> >
> >
> > How shall I interpret the value of PRC Area of the positive class (ie 0.460)? What does this value tell me in plain English? Can I compare it directly to the class proportion (ie  0.344) and conclude that my classifier does 0.116 points better than a random baseline?
> >
> > In my view, this classifier performs poorly on the positive class, that is just above random, regardless a precision of 0.61.
> >
> > How would you assess this performance on the positive class?
> >
> > Thanks in advance.
> >
> > Cheers, Marina
> >
> >
> >
> >
> > On 31 May 2017 at 06:18, Eibe Frank <[hidden email]> wrote:
> >
> > > On 31/05/2017, at 1:22 PM, Shu-Ju Tu <[hidden email]> wrote:
> > >
> > > What does that mean exactly the unbalanced data set?
> > >
> > > Does that mean 50%-50% in a 2-class data set a "balanced" data set?
> >
> > Yes.
> >
> > > In the diabetes 2-class data set
> > > Positive class instances: 268 (35%)
> > > Negative class instances: 500 (65%)
> >
> > This dataset isn’t particularly imbalanced. The “sick” dataset in
> >
> >   http://prdownloads.sourceforge.net/weka/datasets-UCI.jar
> >
> > is quite imbalanced:
> >
> > ZeroR predicts class value: negative
> >
> > Time taken to build model: 0 seconds
> > Time taken to test model on training data: 0.04 seconds
> >
> > === Error on training data ===
> >
> > Correctly Classified Instances        3541               93.8759 %
> > Incorrectly Classified Instances       231                6.1241 %
> > Kappa statistic                          0
> > EER                                      0.5
> > Quadratic Weighted Kappa 0.0Mean absolute error                      0.1152
> > Root mean squared error                  0.2398
> > Relative absolute error                100      %
> > Root relative squared error            100      %
> > Total Number of Instances             3772
> >
> >
> > === Detailed Accuracy By Class ===
> >
> >                  TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
> >                  1.000    1.000    0.939      1.000    0.968      0.000    0.500     0.939     negative
> >                  0.000    0.000    0.000      0.000    0.000      0.000    0.500     0.061     sick
> > Weighted Avg.    0.939    0.939    0.881      0.939    0.909      0.000    0.500     0.885
> >
> >
> > === Confusion Matrix ===
> >
> >     a    b   <-- classified as
> >  3541    0 |    a = negative
> >   231    0 |    b = sick
> >
> > Cheers,
> > Eibe
> > _______________________________________________
> > Wekalist mailing list
> > Send posts to: [hidden email]
> > List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
> > List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
> >
> > _______________________________________________
> > Wekalist mailing list
> > Send posts to: [hidden email]
> > List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
> > List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
>
> _______________________________________________
> Wekalist mailing list
> Send posts to: [hidden email]
> List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
> List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
>
> <Screen Shot 2017-06-01 at 10.20.35 AM.png>_______________________________________________
> Wekalist mailing list
> Send posts to: [hidden email]
> List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
> List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Unbalanced datasets: How to compare the Precision of a weka classifier against the Precision of a guess classifier

Marina Santini
Thanx a lot!

cheers, Marina

On 2 June 2017 at 01:41, Eibe Frank <[hidden email]> wrote:
The curve actually shows that the classifier performs much better than a random ranker for very small values of recall! That’s why it can be important to look at the actual curve: just considering the area under the curve we wouldn’t have seen this.

According to the precision at K (P@K) metric, which WEKA unfortunately doesn’t output, your classifier would be a good one for values of K that are not excessively large.

Cheers,
Eibe

> On 1/06/2017, at 8:49 PM, Marina Santini <[hidden email]> wrote:
>
> Hi,
>
> here is the PRCurve.
>
>
> <image.png>
>
>
> Is there any rule of thumb to assess the goodness of this curve?
>
> If I understand correctly, the baseline of this curve is the precision of a random classifier on the same class., ie. 0.34 (that corresponds to the probability class distribution).
> This area under this curve is 0.46. The PRC area of a perfect classifier is 1.0.
> Therefore we have to assess this curve wrt to the interval 0.34 and 1.0.
>
> Is there any scale that we can use to interpret this interval?
>
> Intuitively, I would say that this curve indicates a performance just above random and tells us that the classifier is not doing a good job on that class. But I am not sure if I  am too pessimistic in this interpretation.
>
> What would be a fair value? 0.70 maybe ?
>
> iI you could suggest any references indicating how to interpret the values of a PRC area, I would be very grateful.
>
> Thanks a lot for your help
>
> Cheers, Marina
>
>
> On 1 June 2017 at 00:01, Eibe Frank <[hidden email]> wrote:
> Yes, it looks pretty poor. This is consistent with the result you got for the ROC area. The ROC area is pretty close to 0.5, which is the ROC area you would get for a random ranker.
>
> You should also check the precision-recall curve itself, by right-clicking on the classifier’s entry in the result list and selecting “Visualize threshold curve”.
>
> Cheers,
> Eibe
>
> > On 31/05/2017, at 10:13 PM, Marina Santini <[hidden email]> wrote:
> >
> > Hi,
> >
> > A follow up question, maybe you have already explained this or implied what I am going to ask, but... i would like to find the words to find a clear statement on the performance of a classifier with respect to a random baseline on the minority class.
> >
> > This are my results:
> >
> > the class proportions are the following:
> > negative = <a href="tel:21361%2F32542" value="+12136132542">21361/32542 = 0.656
> >
> > positive = 11181/32542 = 0.344
> >
> >
> >
> > === Stratified cross-validation ===
> > === Summary ===
> >
> > Correctly Classified Instances       22007               67.6265 %
> > Incorrectly Classified Instances     10535               32.3735 %
> > Kappa statistic                          0.1274
> >
> > Total Number of Instances            32542
> >
> > === Detailed Accuracy By Class ===
> >
> >                  TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
> >                   0,948    0,844        0,682        0,948    0,794         0,176         0,592     0,724     negative
> >                   0,156    0,052        0,613        0,156    0,249         0,176         0,592     0,460     positive
> > Weighted Avg.    0,676    0,572    0,659      0,676    0,607      0,176    0,592     0,633
> >
> > === Confusion Matrix ===
> >
> >      a     b   <-- classified as
> >  20259  1102 |     a = negative
> >   9433  1748 |     b = >positive
> >
> >
> > How shall I interpret the value of PRC Area of the positive class (ie 0.460)? What does this value tell me in plain English? Can I compare it directly to the class proportion (ie  0.344) and conclude that my classifier does 0.116 points better than a random baseline?
> >
> > In my view, this classifier performs poorly on the positive class, that is just above random, regardless a precision of 0.61.
> >
> > How would you assess this performance on the positive class?
> >
> > Thanks in advance.
> >
> > Cheers, Marina
> >
> >
> >
> >
> > On 31 May 2017 at 06:18, Eibe Frank <[hidden email]> wrote:
> >
> > > On 31/05/2017, at 1:22 PM, Shu-Ju Tu <[hidden email]> wrote:
> > >
> > > What does that mean exactly the unbalanced data set?
> > >
> > > Does that mean 50%-50% in a 2-class data set a "balanced" data set?
> >
> > Yes.
> >
> > > In the diabetes 2-class data set
> > > Positive class instances: 268 (35%)
> > > Negative class instances: 500 (65%)
> >
> > This dataset isn’t particularly imbalanced. The “sick” dataset in
> >
> >   http://prdownloads.sourceforge.net/weka/datasets-UCI.jar
> >
> > is quite imbalanced:
> >
> > ZeroR predicts class value: negative
> >
> > Time taken to build model: 0 seconds
> > Time taken to test model on training data: 0.04 seconds
> >
> > === Error on training data ===
> >
> > Correctly Classified Instances        3541               93.8759 %
> > Incorrectly Classified Instances       231                6.1241 %
> > Kappa statistic                          0
> > EER                                      0.5
> > Quadratic Weighted Kappa 0.0Mean absolute error                      0.1152
> > Root mean squared error                  0.2398
> > Relative absolute error                100      %
> > Root relative squared error            100      %
> > Total Number of Instances             3772
> >
> >
> > === Detailed Accuracy By Class ===
> >
> >                  TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
> >                  1.000    1.000    0.939      1.000    0.968      0.000    0.500     0.939     negative
> >                  0.000    0.000    0.000      0.000    0.000      0.000    0.500     0.061     sick
> > Weighted Avg.    0.939    0.939    0.881      0.939    0.909      0.000    0.500     0.885
> >
> >
> > === Confusion Matrix ===
> >
> >     a    b   <-- classified as
> >  3541    0 |    a = negative
> >   231    0 |    b = sick
> >
> > Cheers,
> > Eibe
> > _______________________________________________
> > Wekalist mailing list
> > Send posts to: [hidden email]
> > List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
> > List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
> >
> > _______________________________________________
> > Wekalist mailing list
> > Send posts to: [hidden email]
> > List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
> > List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
>
> _______________________________________________
> Wekalist mailing list
> Send posts to: [hidden email]
> List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
> List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
>
> <Screen Shot 2017-06-01 at 10.20.35 AM.png>_______________________________________________
> Wekalist mailing list
> Send posts to: [hidden email]
> List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
> List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html


_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html