Unbalanced dataset and multiclass cost matrix

classic Classic list List threaded Threaded
19 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Unbalanced dataset and multiclass cost matrix

Marina Santini
Hi,

I have a real-world dataset. The dataset is unbalanced. Predictors are nominal and quite weak. I am using 9 predictors. 
The main task is to predict the minority class.
I tried several approaches (classbalancer, spreadsubsample, resample, bagging, stacking) but all of them not satisfying.

Now I am focusing on cost-sensitive classification,
I need some advice on how to adjust and calibrate the cost matrix for a multiclass problem. I have not found any documentation on that.
Could you please help me understand how to calculate the weights and move them around when using cost-sensitive classification for a multiclass problem?

My dataset looks like this:
Inline images 10
 

My goal is to get the first class (ie the blue block) and the last class (ie the cyan block) as correct as possible.
The middle class is to be used as a “safety net” that should “host” misclassified instances belonging to the minority class (ie the cyan block).

The basic performance with NB-k without cost matrix is the following:

Inline images 11

My goal is to create as few FPs and FNs for the a (SHORT) class and for the c (LONG) class. I use cost-sensitive classification with the NB-k classifier, but I cannot guess the “ideal” WEIGHTS given my goal. I tried out many combinations of the weights, for example:

Inline images 12

I am not happy with the results. Is there any rule of thumb or any proper approach to adjust a cost matrix in order to solve my classification problem?

Thanks in advance for your answer.

 
Best regards

Marina

_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Unbalanced dataset and multiclass cost matrix

Marina Santini
Thanks Davide. 

DT does not perform well on this dataset. It has the same performance of ZeroR classifier.  I tried out SMOTE and NB-k, and the performance improves a little bit. Recall is ok, but Precision still too low.

But thanks for suggesting SMOTE, I had not tried it out before. 

Cheers, Marina

On 11 June 2017 at 16:59, Davide Barbieri <[hidden email]> wrote:
Hello Marina
Have you tried a filtered classifier with multi filter? Filters may be: SMOTE (over sampling) and spread subsample (under sampling)
In this way you can balance your training set, and test your classifier on the imbalanced test set
Also, I would try J48

Davide Barbieri

Inviato da iPhone

> Il giorno 11 giu 2017, alle ore 15:47, Marina Santini <[hidden email]> ha scritto:
>
> Hi,
>
> I have a real-world dataset. The dataset is unbalanced. Predictors are nominal and quite weak. I am using 9 predictors.
> The main task is to predict the minority class.
> I tried several approaches (classbalancer, spreadsubsample, resample, bagging, stacking) but all of them not satisfying.
>
> Now I am focusing on cost-sensitive classification,
> I need some advice on how to adjust and calibrate the cost matrix for a multiclass problem. I have not found any documentation on that.
> Could you please help me understand how to calculate the weights and move them around when using cost-sensitive classification for a multiclass problem?
>
> My dataset looks like this:
> <image.png>
>
>
> My goal is to get the first class (ie the blue block) and the last class (ie the cyan block) as correct as possible.
> The middle class is to be used as a “safety net” that should “host” misclassified instances belonging to the minority class (ie the cyan block).
>
> The basic performance with NB-k without cost matrix is the following:
>
> <image.png>
>
> My goal is to create as few FPs and FNs for the a (SHORT) class and for the c (LONG) class. I use cost-sensitive classification with the NB-k classifier, but I cannot guess the “ideal” WEIGHTS given my goal. I tried out many combinations of the weights, for example:
>
> <image.png>
>
> I am not happy with the results. Is there any rule of thumb or any proper approach to adjust a cost matrix in order to solve my classification problem?
>
> Thanks in advance for your answer.
>
>
> Best regards
>
> Marina
> _______________________________________________
> Wekalist mailing list
> Send posts to: [hidden email]
> List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
> List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html


_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Unbalanced dataset and multiclass cost matrix

Alexander Osherenko
Marina,

you can install the AutoWEKA package that will run different classifiers and identify the one that calculates the best results. According to http://www.cs.ubc.ca/labs/beta/Projects/autoweka/papers/autoweka.pdf, AutoWEKA maintains all classification approaches implemented in WEKA, spanning 2 ensemble methods, 10 meta-methods, 27 base classifiers. You can set up different measures to detect the best classifier, for example, the one that calculates the highest fMeasure or the highest weighted fMeasure.

However, I doubt it is a perfect solution for classification of unbalanced data since you use standard recall and standard precision. I would rather look at this data from another perspective and use your own heuristic measure to identify the best classifier, for example, distance between wrong and correct outcomes in the confusion matrix what would better be more adequate for your unbalanced data problem.

​Best, Alexander

2017-06-11 18:36 GMT+01:00 Marina Santini <[hidden email]>:
Thanks Davide. 

DT does not perform well on this dataset. It has the same performance of ZeroR classifier.  I tried out SMOTE and NB-k, and the performance improves a little bit. Recall is ok, but Precision still too low.

But thanks for suggesting SMOTE, I had not tried it out before. 

Cheers, Marina

On 11 June 2017 at 16:59, Davide Barbieri <[hidden email]> wrote:
Hello Marina
Have you tried a filtered classifier with multi filter? Filters may be: SMOTE (over sampling) and spread subsample (under sampling)
In this way you can balance your training set, and test your classifier on the imbalanced test set
Also, I would try J48

Davide Barbieri

Inviato da iPhone

> Il giorno 11 giu 2017, alle ore 15:47, Marina Santini <[hidden email]> ha scritto:
>
> Hi,
>
> I have a real-world dataset. The dataset is unbalanced. Predictors are nominal and quite weak. I am using 9 predictors.
> The main task is to predict the minority class.
> I tried several approaches (classbalancer, spreadsubsample, resample, bagging, stacking) but all of them not satisfying.
>
> Now I am focusing on cost-sensitive classification,
> I need some advice on how to adjust and calibrate the cost matrix for a multiclass problem. I have not found any documentation on that.
> Could you please help me understand how to calculate the weights and move them around when using cost-sensitive classification for a multiclass problem?
>
> My dataset looks like this:
> <image.png>
>
>
> My goal is to get the first class (ie the blue block) and the last class (ie the cyan block) as correct as possible.
> The middle class is to be used as a “safety net” that should “host” misclassified instances belonging to the minority class (ie the cyan block).
>
> The basic performance with NB-k without cost matrix is the following:
>
> <image.png>
>
> My goal is to create as few FPs and FNs for the a (SHORT) class and for the c (LONG) class. I use cost-sensitive classification with the NB-k classifier, but I cannot guess the “ideal” WEIGHTS given my goal. I tried out many combinations of the weights, for example:
>
> <image.png>
>
> I am not happy with the results. Is there any rule of thumb or any proper approach to adjust a cost matrix in order to solve my classification problem?
>
> Thanks in advance for your answer.
>
>
> Best regards
>
> Marina
> _______________________________________________
> Wekalist mailing list
> Send posts to: [hidden email]
> List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
> List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html


_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html



_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Unbalanced dataset and multiclass cost matrix

Marina Santini
Big thanks, Alexander. I will give it a try!

Cheers, Marina

On 11 June 2017 at 21:23, Alexander Osherenko <[hidden email]> wrote:
Marina,

you can install the AutoWEKA package that will run different classifiers and identify the one that calculates the best results. According to http://www.cs.ubc.ca/labs/beta/Projects/autoweka/papers/autoweka.pdf, AutoWEKA maintains all classification approaches implemented in WEKA, spanning 2 ensemble methods, 10 meta-methods, 27 base classifiers. You can set up different measures to detect the best classifier, for example, the one that calculates the highest fMeasure or the highest weighted fMeasure.

However, I doubt it is a perfect solution for classification of unbalanced data since you use standard recall and standard precision. I would rather look at this data from another perspective and use your own heuristic measure to identify the best classifier, for example, distance between wrong and correct outcomes in the confusion matrix what would better be more adequate for your unbalanced data problem.

​Best, Alexander

2017-06-11 18:36 GMT+01:00 Marina Santini <[hidden email]>:
Thanks Davide. 

DT does not perform well on this dataset. It has the same performance of ZeroR classifier.  I tried out SMOTE and NB-k, and the performance improves a little bit. Recall is ok, but Precision still too low.

But thanks for suggesting SMOTE, I had not tried it out before. 

Cheers, Marina

On 11 June 2017 at 16:59, Davide Barbieri <[hidden email]> wrote:
Hello Marina
Have you tried a filtered classifier with multi filter? Filters may be: SMOTE (over sampling) and spread subsample (under sampling)
In this way you can balance your training set, and test your classifier on the imbalanced test set
Also, I would try J48

Davide Barbieri

Inviato da iPhone

> Il giorno 11 giu 2017, alle ore 15:47, Marina Santini <[hidden email]> ha scritto:
>
> Hi,
>
> I have a real-world dataset. The dataset is unbalanced. Predictors are nominal and quite weak. I am using 9 predictors.
> The main task is to predict the minority class.
> I tried several approaches (classbalancer, spreadsubsample, resample, bagging, stacking) but all of them not satisfying.
>
> Now I am focusing on cost-sensitive classification,
> I need some advice on how to adjust and calibrate the cost matrix for a multiclass problem. I have not found any documentation on that.
> Could you please help me understand how to calculate the weights and move them around when using cost-sensitive classification for a multiclass problem?
>
> My dataset looks like this:
> <image.png>
>
>
> My goal is to get the first class (ie the blue block) and the last class (ie the cyan block) as correct as possible.
> The middle class is to be used as a “safety net” that should “host” misclassified instances belonging to the minority class (ie the cyan block).
>
> The basic performance with NB-k without cost matrix is the following:
>
> <image.png>
>
> My goal is to create as few FPs and FNs for the a (SHORT) class and for the c (LONG) class. I use cost-sensitive classification with the NB-k classifier, but I cannot guess the “ideal” WEIGHTS given my goal. I tried out many combinations of the weights, for example:
>
> <image.png>
>
> I am not happy with the results. Is there any rule of thumb or any proper approach to adjust a cost matrix in order to solve my classification problem?
>
> Thanks in advance for your answer.
>
>
> Best regards
>
> Marina
> _______________________________________________
> Wekalist mailing list
> Send posts to: [hidden email]
> List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
> List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html


_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html




_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Unbalanced dataset and multiclass cost matrix

Eibe Frank-2
Administrator
In reply to this post by Marina Santini
Have you tried RandomForest? It usually gives good performance without much parameter tuning.  You could also try wrapping RandomForest into OrdinalClassClassifier because your class values are ordered. This can sometimes also help a little.

You should assign maximum cost to errors where LONG and SHORT are confused. This is not the case in the confusion matrix you sent. Something like

0 1 2
1 0 1
2 1 0

would be more standard.

Note also that, normally, switching the CostSensitiveClassifier to “minimizeExpectedCost” works a little better, and it’s a more principled approach when you have more than two classes than the CostSensitiveClassifier default based on reweighting instances.

You may also want to take a look at the “Cost/Benefit analysis” tool in WEKA. One way to access it is to right-click on the appropriate entry in the “Result list” of the Classifier panel. You have to choose the class you are interested in (SHORT in your case) and the other classes are merged into one. This simplifies the cost matrix. It may allow you to get a feeling for what cost values are appropriate and whether it’s possible to achieve reasonable discrimination for the SHORT class at all.

Cheers,
Eibe

> On 12/06/2017, at 1:47 AM, Marina Santini <[hidden email]> wrote:
>
> Hi,
>
> I have a real-world dataset. The dataset is unbalanced. Predictors are nominal and quite weak. I am using 9 predictors.
> The main task is to predict the minority class.
> I tried several approaches (classbalancer, spreadsubsample, resample, bagging, stacking) but all of them not satisfying.
>
> Now I am focusing on cost-sensitive classification,
> I need some advice on how to adjust and calibrate the cost matrix for a multiclass problem. I have not found any documentation on that.
> Could you please help me understand how to calculate the weights and move them around when using cost-sensitive classification for a multiclass problem?
>
> My dataset looks like this:
> <image.png>
>  
>
> My goal is to get the first class (ie the blue block) and the last class (ie the cyan block) as correct as possible.
> The middle class is to be used as a “safety net” that should “host” misclassified instances belonging to the minority class (ie the cyan block).
>
> The basic performance with NB-k without cost matrix is the following:
>
> <image.png>
>
> My goal is to create as few FPs and FNs for the a (SHORT) class and for the c (LONG) class. I use cost-sensitive classification with the NB-k classifier, but I cannot guess the “ideal” WEIGHTS given my goal. I tried out many combinations of the weights, for example:
>
> <image.png>
>
> I am not happy with the results. Is there any rule of thumb or any proper approach to adjust a cost matrix in order to solve my classification problem?
>
> Thanks in advance for your answer.
>
>  
> Best regards
>
> Marina
> _______________________________________________
> Wekalist mailing list
> Send posts to: [hidden email]
> List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
> List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Unbalanced dataset and multiclass cost matrix

Marina Santini
Thanks Eibe.

Where is OrdinalClassClassifier in Explorer? I could not find it...'

I does not appear in the Tool Manager, either

Cheers, Marina


On 12 June 2017 at 01:23, Eibe Frank <[hidden email]> wrote:

> Have you tried RandomForest? It usually gives good performance without much parameter tuning.  You could also try wrapping RandomForest into OrdinalClassClassifier because your class values are ordered. This can sometimes also help a little.
>
> You should assign maximum cost to errors where LONG and SHORT are confused. This is not the case in the confusion matrix you sent. Something like
>
> 0 1 2
> 1 0 1
> 2 1 0
>
> would be more standard.
>
> Note also that, normally, switching the CostSensitiveClassifier to “minimizeExpectedCost” works a little better, and it’s a more principled approach when you have more than two classes than the CostSensitiveClassifier default based on reweighting instances.
>
> You may also want to take a look at the “Cost/Benefit analysis” tool in WEKA. One way to access it is to right-click on the appropriate entry in the “Result list” of the Classifier panel. You have to choose the class you are interested in (SHORT in your case) and the other classes are merged into one. This simplifies the cost matrix. It may allow you to get a feeling for what cost values are appropriate and whether it’s possible to achieve reasonable discrimination for the SHORT class at all.
>
> Cheers,
> Eibe
>
>> On 12/06/2017, at 1:47 AM, Marina Santini <[hidden email]> wrote:
>>
>> Hi,
>>
>> I have a real-world dataset. The dataset is unbalanced. Predictors are nominal and quite weak. I am using 9 predictors.
>> The main task is to predict the minority class.
>> I tried several approaches (classbalancer, spreadsubsample, resample, bagging, stacking) but all of them not satisfying.
>>
>> Now I am focusing on cost-sensitive classification,
>> I need some advice on how to adjust and calibrate the cost matrix for a multiclass problem. I have not found any documentation on that.
>> Could you please help me understand how to calculate the weights and move them around when using cost-sensitive classification for a multiclass problem?
>>
>> My dataset looks like this:
>> <image.png>
>>
>>
>> My goal is to get the first class (ie the blue block) and the last class (ie the cyan block) as correct as possible.
>> The middle class is to be used as a “safety net” that should “host” misclassified instances belonging to the minority class (ie the cyan block).
>>
>> The basic performance with NB-k without cost matrix is the following:
>>
>> <image.png>
>>
>> My goal is to create as few FPs and FNs for the a (SHORT) class and for the c (LONG) class. I use cost-sensitive classification with the NB-k classifier, but I cannot guess the “ideal” WEIGHTS given my goal. I tried out many combinations of the weights, for example:
>>
>> <image.png>
>>
>> I am not happy with the results. Is there any rule of thumb or any proper approach to adjust a cost matrix in order to solve my classification problem?
>>
>> Thanks in advance for your answer.
>>
>>
>> Best regards
>>
>> Marina
>> _______________________________________________
>> Wekalist mailing list
>> Send posts to: [hidden email]
>> List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
>> List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
>
> _______________________________________________
> Wekalist mailing list
> Send posts to: [hidden email]
> List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
> List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Unbalanced dataset and multiclass cost matrix

Eibe Frank-2
Administrator
It's in a separate package of the same name.

Cheers,
Eibe

> On 12/06/2017, at 7:49 PM, Marina Santini <[hidden email]> wrote:
>
> Thanks Eibe.
>
> Where is OrdinalClassClassifier in Explorer? I could not find it...'
>
> I does not appear in the Tool Manager, either
>
> Cheers, Marina
>
>
>> On 12 June 2017 at 01:23, Eibe Frank <[hidden email]> wrote:
>> Have you tried RandomForest? It usually gives good performance without much parameter tuning.  You could also try wrapping RandomForest into OrdinalClassClassifier because your class values are ordered. This can sometimes also help a little.
>>
>> You should assign maximum cost to errors where LONG and SHORT are confused. This is not the case in the confusion matrix you sent. Something like
>>
>> 0 1 2
>> 1 0 1
>> 2 1 0
>>
>> would be more standard.
>>
>> Note also that, normally, switching the CostSensitiveClassifier to “minimizeExpectedCost” works a little better, and it’s a more principled approach when you have more than two classes than the CostSensitiveClassifier default based on reweighting instances.
>>
>> You may also want to take a look at the “Cost/Benefit analysis” tool in WEKA. One way to access it is to right-click on the appropriate entry in the “Result list” of the Classifier panel. You have to choose the class you are interested in (SHORT in your case) and the other classes are merged into one. This simplifies the cost matrix. It may allow you to get a feeling for what cost values are appropriate and whether it’s possible to achieve reasonable discrimination for the SHORT class at all.
>>
>> Cheers,
>> Eibe
>>
>>> On 12/06/2017, at 1:47 AM, Marina Santini <[hidden email]> wrote:
>>>
>>> Hi,
>>>
>>> I have a real-world dataset. The dataset is unbalanced. Predictors are nominal and quite weak. I am using 9 predictors.
>>> The main task is to predict the minority class.
>>> I tried several approaches (classbalancer, spreadsubsample, resample, bagging, stacking) but all of them not satisfying.
>>>
>>> Now I am focusing on cost-sensitive classification,
>>> I need some advice on how to adjust and calibrate the cost matrix for a multiclass problem. I have not found any documentation on that.
>>> Could you please help me understand how to calculate the weights and move them around when using cost-sensitive classification for a multiclass problem?
>>>
>>> My dataset looks like this:
>>> <image.png>
>>>
>>>
>>> My goal is to get the first class (ie the blue block) and the last class (ie the cyan block) as correct as possible.
>>> The middle class is to be used as a “safety net” that should “host” misclassified instances belonging to the minority class (ie the cyan block).
>>>
>>> The basic performance with NB-k without cost matrix is the following:
>>>
>>> <image.png>
>>>
>>> My goal is to create as few FPs and FNs for the a (SHORT) class and for the c (LONG) class. I use cost-sensitive classification with the NB-k classifier, but I cannot guess the “ideal” WEIGHTS given my goal. I tried out many combinations of the weights, for example:
>>>
>>> <image.png>
>>>
>>> I am not happy with the results. Is there any rule of thumb or any proper approach to adjust a cost matrix in order to solve my classification problem?
>>>
>>> Thanks in advance for your answer.
>>>
>>>
>>> Best regards
>>>
>>> Marina
>>> _______________________________________________
>>> Wekalist mailing list
>>> Send posts to: [hidden email]
>>> List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
>>> List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
>>
>> _______________________________________________
>> Wekalist mailing list
>> Send posts to: [hidden email]
>> List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
>> List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
> _______________________________________________
> Wekalist mailing list
> Send posts to: [hidden email]
> List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
> List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Unbalanced dataset and multiclass cost matrix

Alexander Osherenko
In reply to this post by Marina Santini
Marina, you may get another problem when you find a classification approach that perfectly classifies your data. It can work suboptimal if you classify other data (something like "classifier overfitting"). I had this problem when comparing classification results of emotion data using NaiveBayes and SMO. NaiveBayes works very good on sparse data, but worse in other cases.

Best, Alexander

2017-06-12 8:49 GMT+01:00 Marina Santini <[hidden email]>:
Thanks Eibe.

Where is OrdinalClassClassifier in Explorer? I could not find it...'

I does not appear in the Tool Manager, either

Cheers, Marina


On 12 June 2017 at 01:23, Eibe Frank <[hidden email]> wrote:
> Have you tried RandomForest? It usually gives good performance without much parameter tuning.  You could also try wrapping RandomForest into OrdinalClassClassifier because your class values are ordered. This can sometimes also help a little.
>
> You should assign maximum cost to errors where LONG and SHORT are confused. This is not the case in the confusion matrix you sent. Something like
>
> 0 1 2
> 1 0 1
> 2 1 0
>
> would be more standard.
>
> Note also that, normally, switching the CostSensitiveClassifier to “minimizeExpectedCost” works a little better, and it’s a more principled approach when you have more than two classes than the CostSensitiveClassifier default based on reweighting instances.
>
> You may also want to take a look at the “Cost/Benefit analysis” tool in WEKA. One way to access it is to right-click on the appropriate entry in the “Result list” of the Classifier panel. You have to choose the class you are interested in (SHORT in your case) and the other classes are merged into one. This simplifies the cost matrix. It may allow you to get a feeling for what cost values are appropriate and whether it’s possible to achieve reasonable discrimination for the SHORT class at all.
>
> Cheers,
> Eibe
>
>> On 12/06/2017, at 1:47 AM, Marina Santini <[hidden email]> wrote:
>>
>> Hi,
>>
>> I have a real-world dataset. The dataset is unbalanced. Predictors are nominal and quite weak. I am using 9 predictors.
>> The main task is to predict the minority class.
>> I tried several approaches (classbalancer, spreadsubsample, resample, bagging, stacking) but all of them not satisfying.
>>
>> Now I am focusing on cost-sensitive classification,
>> I need some advice on how to adjust and calibrate the cost matrix for a multiclass problem. I have not found any documentation on that.
>> Could you please help me understand how to calculate the weights and move them around when using cost-sensitive classification for a multiclass problem?
>>
>> My dataset looks like this:
>> <image.png>
>>
>>
>> My goal is to get the first class (ie the blue block) and the last class (ie the cyan block) as correct as possible.
>> The middle class is to be used as a “safety net” that should “host” misclassified instances belonging to the minority class (ie the cyan block).
>>
>> The basic performance with NB-k without cost matrix is the following:
>>
>> <image.png>
>>
>> My goal is to create as few FPs and FNs for the a (SHORT) class and for the c (LONG) class. I use cost-sensitive classification with the NB-k classifier, but I cannot guess the “ideal” WEIGHTS given my goal. I tried out many combinations of the weights, for example:
>>
>> <image.png>
>>
>> I am not happy with the results. Is there any rule of thumb or any proper approach to adjust a cost matrix in order to solve my classification problem?
>>
>> Thanks in advance for your answer.
>>
>>
>> Best regards
>>
>> Marina
>> _______________________________________________
>> Wekalist mailing list
>> Send posts to: [hidden email]
>> List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
>> List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
>
> _______________________________________________
> Wekalist mailing list
> Send posts to: [hidden email]
> List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
> List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html


_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Unbalanced dataset and multiclass cost matrix

Michael Hall
In reply to this post by Marina Santini

> On Jun 11, 2017, at 8:47 AM, Marina Santini <[hidden email]> wrote:
>
>
> Now I am focusing on cost-sensitive classification,

This is not a normal use of a cost matrix and probably doesn’t improve overall classification performance does it?
My understanding is that normally a cost matrix is used if there is a different degree of concern for different classification errors?
Say ‘customers hair turns orange’ is considered very bad to get wrong so you make it more costly.
When I tried it, it did seem to improve classification for the higher cost entries but didn’t seem to result in improved overall performance.
The error rate for other outcomes suffered?
If I remember seeming to result in the classifier getting about the same overall performance as it did without the cost matrix.
Are there circumstances where this will improve overall?

Mike Hall

_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Unbalanced dataset and multiclass cost matrix

Eibe Frank-2
Administrator
No, you are right, you are unlikely to get an improvement overall, at least with the minimum expected cost approach. The default cost matrix, assigning the same cost to every type of error, should be best in terms of overall misclassification error, unless the classifier's probability estimates are inaccurate.

Cheers,
Eibe

> On 12/06/2017, at 8:42 PM, Michael Hall <[hidden email]> wrote:
>
>
>> On Jun 11, 2017, at 8:47 AM, Marina Santini <[hidden email]> wrote:
>>
>>
>> Now I am focusing on cost-sensitive classification,
>
> This is not a normal use of a cost matrix and probably doesn’t improve overall classification performance does it?
> My understanding is that normally a cost matrix is used if there is a different degree of concern for different classification errors?
> Say ‘customers hair turns orange’ is considered very bad to get wrong so you make it more costly.
> When I tried it, it did seem to improve classification for the higher cost entries but didn’t seem to result in improved overall performance.
> The error rate for other outcomes suffered?
> If I remember seeming to result in the classifier getting about the same overall performance as it did without the cost matrix.
> Are there circumstances where this will improve overall?
>
> Mike Hall
>
> _______________________________________________
> Wekalist mailing list
> Send posts to: [hidden email]
> List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
> List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Unbalanced dataset and multiclass cost matrix

Marina Santini
Thanks everybody for insightful answers. 

I am trying out many different solutions because the aim is to use a ml model for a real world situation. 
I tried binary classification, but my impression is that it can be even harder. 

With 3 classes, I get this results with simple NB-k. 

=== Stratified cross-validation ===
=== Summary ===

Correctly Classified Instances       57727               52.699  %
Incorrectly Classified Instances     51814               47.301  %
Kappa statistic                          0.1662
Mean absolute error                      0.3679
Root mean squared error                  0.4314
Relative absolute error                 93.7958 %
Root relative squared error             97.4222 %
Total Number of Instances           109541     

=== Detailed Accuracy By Class ===

                 TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
                 0,701    0,452    0,530      0,701    0,603      0,247    0,663     0,555     SHORT
                 0,468    0,365    0,534      0,468    0,499      0,104    0,573     0,532     MEDIUM
                 0,106    0,021    0,378      0,106    0,166      0,154    0,733     0,264     LONG
Weighted Avg.    0,527    0,365    0,516      0,527    0,507      0,170    0,628     0,513     

=== Confusion Matrix ===

     a     b     c   <-- classified as
 32283 13470   330 |     a = SHORT
 25846 24204  1708 |     b = MEDIUM
  2834  7626  1240 |     c = LONG

My aim is to get as ac and aa as possible. Basically I want that the classifiers get the short and long classes as correct as possible. 

In the confusion matrix above, if I could get ac=300, I would start being happy :-)  That's why I was trying to play with weights (ie cost). 

Cheers, Marina


On 12 June 2017 at 12:16, Eibe Frank <[hidden email]> wrote:
No, you are right, you are unlikely to get an improvement overall, at least with the minimum expected cost approach. The default cost matrix, assigning the same cost to every type of error, should be best in terms of overall misclassification error, unless the classifier's probability estimates are inaccurate.

Cheers,
Eibe

> On 12/06/2017, at 8:42 PM, Michael Hall <[hidden email]> wrote:
>
>
>> On Jun 11, 2017, at 8:47 AM, Marina Santini <[hidden email]> wrote:
>>
>>
>> Now I am focusing on cost-sensitive classification,
>
> This is not a normal use of a cost matrix and probably doesn’t improve overall classification performance does it?
> My understanding is that normally a cost matrix is used if there is a different degree of concern for different classification errors?
> Say ‘customers hair turns orange’ is considered very bad to get wrong so you make it more costly.
> When I tried it, it did seem to improve classification for the higher cost entries but didn’t seem to result in improved overall performance.
> The error rate for other outcomes suffered?
> If I remember seeming to result in the classifier getting about the same overall performance as it did without the cost matrix.
> Are there circumstances where this will improve overall?
>
> Mike Hall
>
> _______________________________________________
> Wekalist mailing list
> Send posts to: [hidden email]
> List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
> List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html


_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Unbalanced dataset and multiclass cost matrix

Davide Barbieri
Hello Marina

sorry, but the problem is not so clear to me.
When classes are imbalanced, it is always difficult to understand which is the best parameter to assess classification performances, even in a basic binary problem.
For example it would be very easy to have a perfect TPR=100%, if you declare every instance to be positive (eg. minority class).
It is a so-called "cry wolf situation". But you would have TNR=0, which is usually not acceptable. Also TNR=100% and TPR=0 is easy: all instances are negative.
It is not obvious what is the best trade-off, but you will always reduce sensitivity to increase specificity and vice-versa.
One way to support the decision for the best trade off is to represent it in ROC space, and then use AUC to assess it, or ROC convex hull (ROCCH). Similarly, you can use the Youden index J.
These parameters are acceptable when equal importance is given to both sensitivity and specificity.
In some cases, eg the medical domain, this is not acceptable.
What is the trade off you are looking for?

 ps I hope I have been able to highlight the problem. If not, sorry for long post

2017-06-12 12:27 GMT+02:00 Marina Santini <[hidden email]>:
Thanks everybody for insightful answers. 

I am trying out many different solutions because the aim is to use a ml model for a real world situation. 
I tried binary classification, but my impression is that it can be even harder. 

With 3 classes, I get this results with simple NB-k. 

=== Stratified cross-validation ===
=== Summary ===

Correctly Classified Instances       57727               52.699  %
Incorrectly Classified Instances     51814               47.301  %
Kappa statistic                          0.1662
Mean absolute error                      0.3679
Root mean squared error                  0.4314
Relative absolute error                 93.7958 %
Root relative squared error             97.4222 %
Total Number of Instances           109541     

=== Detailed Accuracy By Class ===

                 TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
                 0,701    0,452    0,530      0,701    0,603      0,247    0,663     0,555     SHORT
                 0,468    0,365    0,534      0,468    0,499      0,104    0,573     0,532     MEDIUM
                 0,106    0,021    0,378      0,106    0,166      0,154    0,733     0,264     LONG
Weighted Avg.    0,527    0,365    0,516      0,527    0,507      0,170    0,628     0,513     

=== Confusion Matrix ===

     a     b     c   <-- classified as
 32283 13470   330 |     a = SHORT
 25846 24204  1708 |     b = MEDIUM
  2834  7626  1240 |     c = LONG

My aim is to get as ac and aa as possible. Basically I want that the classifiers get the short and long classes as correct as possible. 

In the confusion matrix above, if I could get ac=300, I would start being happy :-)  That's why I was trying to play with weights (ie cost). 

Cheers, Marina


On 12 June 2017 at 12:16, Eibe Frank <[hidden email]> wrote:
No, you are right, you are unlikely to get an improvement overall, at least with the minimum expected cost approach. The default cost matrix, assigning the same cost to every type of error, should be best in terms of overall misclassification error, unless the classifier's probability estimates are inaccurate.

Cheers,
Eibe

> On 12/06/2017, at 8:42 PM, Michael Hall <[hidden email]> wrote:
>
>
>> On Jun 11, 2017, at 8:47 AM, Marina Santini <[hidden email]> wrote:
>>
>>
>> Now I am focusing on cost-sensitive classification,
>
> This is not a normal use of a cost matrix and probably doesn’t improve overall classification performance does it?
> My understanding is that normally a cost matrix is used if there is a different degree of concern for different classification errors?
> Say ‘customers hair turns orange’ is considered very bad to get wrong so you make it more costly.
> When I tried it, it did seem to improve classification for the higher cost entries but didn’t seem to result in improved overall performance.
> The error rate for other outcomes suffered?
> If I remember seeming to result in the classifier getting about the same overall performance as it did without the cost matrix.
> Are there circumstances where this will improve overall?
>
> Mike Hall
>
> _______________________________________________
> Wekalist mailing list
> Send posts to: [hidden email]
> List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
> List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html


_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html




--
Davide Barbieri

http://docente.unife.it/davide.barbieri/

Universita' di Ferrara - http://www.unife.it/

_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Unbalanced dataset and multiclass cost matrix

Marina Santini
Hi all, 

I am pondering myself on how i can frame the classification problem properly. As I said, it is a real world problem and I cannot disclose the details. 

I will try to describe the problem using an example about "hospitalization".
Say that: 
The purpose of the classification is to predict people prone to long hospitalization. The focus is on the LONG HOSPITALIZATION (the minority class)
Hospitals want to know that in advance about who is prone to long hospitalization because they want to be ready with number of beds, nurses,  and so on. All this costs money and they want to be able to have an idea of the amount of money they have to invest in the future maybe may be in alternative solutions such as eCare at home and similar (well... hope this makes sense... :-) )

A dataset containing records about people who have been hospitalized  in the previous years is available. The number of days of hospitalization has been recorded.

10 nominal (quick weak) predictors are available, like gender, age, employment etc. 

The boundary between a normal hospitalization and a long hospitalization is 3 months. 

So first I tried a binary classification: <=90days vs >90 days. 

The distribution is the following 

Inline images 1


On this dataset, DT performs random, RandomForest poor, and NB-k produces the (best) following results:

=== Stratified cross-validation ===
=== Summary ===

Correctly Classified Instances       97002               88.5531 %
Incorrectly Classified Instances     12539               11.4469 %
Kappa statistic                          0.1163
Mean absolute error                      0.1737
Root mean squared error                  0.2995
Relative absolute error                 91.0125 %
Root relative squared error             96.9781 %
Total Number of Instances           109541     

=== Detailed Accuracy By Class ===

                 TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
                 0,980    0,901    0,901      0,980    0,939      0,146    0,732     0,950     NORMAL
                 0,099    0,020    0,367      0,099    0,156      0,146    0,732     0,260     LONG
Weighted Avg.    0,886    0,807    0,844      0,886    0,855      0,146    0,732     0,876     

=== Confusion Matrix ===

     a     b   <-- classified as
 95839  2002 |     a = NORMAL
 10537  1163 |     b = LONG


The LONG class has low precision and very poor recall.  

I tried to binarize the predictor with marginal improvement of the minority class: 

=== Detailed Accuracy By Class ===

                 TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
                 0,921    0,714    0,915      0,921    0,918      0,212    0,729     0,950     NORMAL
                 0,286    0,079    0,302      0,286    0,294      0,212    0,729     0,248     LONG
Weighted Avg.    0,853    0,646    0,850      0,853    0,851      0,212    0,729     0,875     

=== Confusion Matrix ===

     a     b   <-- classified as
 90096  7745 |     a = NORMAL
  8355  3345 |     b = LONG

Then I tried all your suggestions and many more. For instance SMOTE gives the following:


                 TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
                 0,939    0,751    0,913      0,939    0,926      0,214    0,732     0,950     NORMAL
                 0,249    0,061    0,330      0,249    0,284      0,214    0,732     0,256     LONG
Weighted Avg.    0,866    0,677    0,850      0,866    0,857      0,214    0,732     0,876     

=== Confusion Matrix ===

     a     b   <-- classified as
 91904  5937 |     a = NORMAL
  8782  2918 |     b = LONG


Then, feeling frustrated, I thought that I could create an intermediary class to collect all the LONG hospitalizations that tend to be misclassified in the binary classification. 
The assumption is that it might be more useful to know clearly if somebody is going to be a short hospitalized or a long  hospitalized, leaving the MEDIUM class as a grey area where the healthcare staff must possibly carry out more investigations. 

So I came up with the 3 classes of my previous email and this explains (I hope) why I would like to decrease the ac cell (2834) and the aa  cell  (330) of the confusion matrix I reported in my previous mail : hospitals allocate different money for short and long hospitalizations, so the would like accurate predictions on these classes or at least accurate predictions on the LONG hospitalizations. 

hope this fictitious example make sense to you

Cheers, Marina




On 12 June 2017 at 12:42, Davide Barbieri <[hidden email]> wrote:
Hello Marina

sorry, but the problem is not so clear to me.
When classes are imbalanced, it is always difficult to understand which is the best parameter to assess classification performances, even in a basic binary problem.
For example it would be very easy to have a perfect TPR=100%, if you declare every instance to be positive (eg. minority class).
It is a so-called "cry wolf situation". But you would have TNR=0, which is usually not acceptable. Also TNR=100% and TPR=0 is easy: all instances are negative.
It is not obvious what is the best trade-off, but you will always reduce sensitivity to increase specificity and vice-versa.
One way to support the decision for the best trade off is to represent it in ROC space, and then use AUC to assess it, or ROC convex hull (ROCCH). Similarly, you can use the Youden index J.
These parameters are acceptable when equal importance is given to both sensitivity and specificity.
In some cases, eg the medical domain, this is not acceptable.
What is the trade off you are looking for?

 ps I hope I have been able to highlight the problem. If not, sorry for long post

2017-06-12 12:27 GMT+02:00 Marina Santini <[hidden email]>:
Thanks everybody for insightful answers. 

I am trying out many different solutions because the aim is to use a ml model for a real world situation. 
I tried binary classification, but my impression is that it can be even harder. 

With 3 classes, I get this results with simple NB-k. 

=== Stratified cross-validation ===
=== Summary ===

Correctly Classified Instances       57727               52.699  %
Incorrectly Classified Instances     51814               47.301  %
Kappa statistic                          0.1662
Mean absolute error                      0.3679
Root mean squared error                  0.4314
Relative absolute error                 93.7958 %
Root relative squared error             97.4222 %
Total Number of Instances           109541     

=== Detailed Accuracy By Class ===

                 TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
                 0,701    0,452    0,530      0,701    0,603      0,247    0,663     0,555     SHORT
                 0,468    0,365    0,534      0,468    0,499      0,104    0,573     0,532     MEDIUM
                 0,106    0,021    0,378      0,106    0,166      0,154    0,733     0,264     LONG
Weighted Avg.    0,527    0,365    0,516      0,527    0,507      0,170    0,628     0,513     

=== Confusion Matrix ===

     a     b     c   <-- classified as
 32283 13470   330 |     a = SHORT
 25846 24204  1708 |     b = MEDIUM
  2834  7626  1240 |     c = LONG

My aim is to get as ac and aa as possible. Basically I want that the classifiers get the short and long classes as correct as possible. 

In the confusion matrix above, if I could get ac=300, I would start being happy :-)  That's why I was trying to play with weights (ie cost). 

Cheers, Marina


On 12 June 2017 at 12:16, Eibe Frank <[hidden email]> wrote:
No, you are right, you are unlikely to get an improvement overall, at least with the minimum expected cost approach. The default cost matrix, assigning the same cost to every type of error, should be best in terms of overall misclassification error, unless the classifier's probability estimates are inaccurate.

Cheers,
Eibe

> On 12/06/2017, at 8:42 PM, Michael Hall <[hidden email]> wrote:
>
>
>> On Jun 11, 2017, at 8:47 AM, Marina Santini <[hidden email]> wrote:
>>
>>
>> Now I am focusing on cost-sensitive classification,
>
> This is not a normal use of a cost matrix and probably doesn’t improve overall classification performance does it?
> My understanding is that normally a cost matrix is used if there is a different degree of concern for different classification errors?
> Say ‘customers hair turns orange’ is considered very bad to get wrong so you make it more costly.
> When I tried it, it did seem to improve classification for the higher cost entries but didn’t seem to result in improved overall performance.
> The error rate for other outcomes suffered?
> If I remember seeming to result in the classifier getting about the same overall performance as it did without the cost matrix.
> Are there circumstances where this will improve overall?
>
> Mike Hall
>
> _______________________________________________
> Wekalist mailing list
> Send posts to: [hidden email]
> List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
> List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html


_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html




--
Davide Barbieri

http://docente.unife.it/davide.barbieri/

Universita' di Ferrara - http://www.unife.it/

_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html



_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Unbalanced dataset and multiclass cost matrix

Davide Barbieri
You experience a typical problem: your classifier is very sensitive to the majority class, which is better represented. 
In order to compensate this, you can improve specificity, increasing the number of minority class instances.
Be sure to apply SMOTE with the necessary rate of oversampling and THEN undersample the majority class.
Go for 1:1 LONG-to-NORMAL ratio.
 
I did something very similar in a medical dataset, where positive instances were only the 8% and I could achieve a 0.78 AUC.
You have 0.73 here, which is not bad, but specificity is very low (0.25).
Again, remember that if you want 100% specificity you can simply classify them all as LONG.
So, before trying, decide what is the best trade-off for you. 

Is the problem of performance assessment in the classification of imbalanced dataset clear ?
In case, you may look at the followng papers:

Fawcett T, An introduction to ROC analysis, Pattern Recognition Letters, 27:861-874, 2006.
Provost F and Fawcett T, Analysis and visualization of classifier performance: Comparison under imprecise cost and classifier distribution, in Proceedings of the Third International Conference on Knowledge Discovery and Data Mining (KDD), Huntington Beach (CA, USA), 1997

2017-06-12 15:27 GMT+02:00 Marina Santini <[hidden email]>:
Hi all, 

I am pondering myself on how i can frame the classification problem properly. As I said, it is a real world problem and I cannot disclose the details. 

I will try to describe the problem using an example about "hospitalization".
Say that: 
The purpose of the classification is to predict people prone to long hospitalization. The focus is on the LONG HOSPITALIZATION (the minority class)
Hospitals want to know that in advance about who is prone to long hospitalization because they want to be ready with number of beds, nurses,  and so on. All this costs money and they want to be able to have an idea of the amount of money they have to invest in the future maybe may be in alternative solutions such as eCare at home and similar (well... hope this makes sense... :-) )

A dataset containing records about people who have been hospitalized  in the previous years is available. The number of days of hospitalization has been recorded.

10 nominal (quick weak) predictors are available, like gender, age, employment etc. 

The boundary between a normal hospitalization and a long hospitalization is 3 months. 

So first I tried a binary classification: <=90days vs >90 days. 

The distribution is the following 

Inline images 1


On this dataset, DT performs random, RandomForest poor, and NB-k produces the (best) following results:

=== Stratified cross-validation ===
=== Summary ===

Correctly Classified Instances       97002               88.5531 %
Incorrectly Classified Instances     12539               11.4469 %
Kappa statistic                          0.1163
Mean absolute error                      0.1737
Root mean squared error                  0.2995
Relative absolute error                 91.0125 %
Root relative squared error             96.9781 %
Total Number of Instances           109541     

=== Detailed Accuracy By Class ===

                 TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
                 0,980    0,901    0,901      0,980    0,939      0,146    0,732     0,950     NORMAL
                 0,099    0,020    0,367      0,099    0,156      0,146    0,732     0,260     LONG
Weighted Avg.    0,886    0,807    0,844      0,886    0,855      0,146    0,732     0,876     

=== Confusion Matrix ===

     a     b   <-- classified as
 95839  2002 |     a = NORMAL
 10537  1163 |     b = LONG


The LONG class has low precision and very poor recall.  

I tried to binarize the predictor with marginal improvement of the minority class: 

=== Detailed Accuracy By Class ===

                 TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
                 0,921    0,714    0,915      0,921    0,918      0,212    0,729     0,950     NORMAL
                 0,286    0,079    0,302      0,286    0,294      0,212    0,729     0,248     LONG
Weighted Avg.    0,853    0,646    0,850      0,853    0,851      0,212    0,729     0,875     

=== Confusion Matrix ===

     a     b   <-- classified as
 90096  7745 |     a = NORMAL
  8355  3345 |     b = LONG

Then I tried all your suggestions and many more. For instance SMOTE gives the following:


                 TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
                 0,939    0,751    0,913      0,939    0,926      0,214    0,732     0,950     NORMAL
                 0,249    0,061    0,330      0,249    0,284      0,214    0,732     0,256     LONG
Weighted Avg.    0,866    0,677    0,850      0,866    0,857      0,214    0,732     0,876     

=== Confusion Matrix ===

     a     b   <-- classified as
 91904  5937 |     a = NORMAL
  8782  2918 |     b = LONG


Then, feeling frustrated, I thought that I could create an intermediary class to collect all the LONG hospitalizations that tend to be misclassified in the binary classification. 
The assumption is that it might be more useful to know clearly if somebody is going to be a short hospitalized or a long  hospitalized, leaving the MEDIUM class as a grey area where the healthcare staff must possibly carry out more investigations. 

So I came up with the 3 classes of my previous email and this explains (I hope) why I would like to decrease the ac cell (2834) and the aa  cell  (330) of the confusion matrix I reported in my previous mail : hospitals allocate different money for short and long hospitalizations, so the would like accurate predictions on these classes or at least accurate predictions on the LONG hospitalizations. 

hope this fictitious example make sense to you

Cheers, Marina




On 12 June 2017 at 12:42, Davide Barbieri <[hidden email]> wrote:
Hello Marina

sorry, but the problem is not so clear to me.
When classes are imbalanced, it is always difficult to understand which is the best parameter to assess classification performances, even in a basic binary problem.
For example it would be very easy to have a perfect TPR=100%, if you declare every instance to be positive (eg. minority class).
It is a so-called "cry wolf situation". But you would have TNR=0, which is usually not acceptable. Also TNR=100% and TPR=0 is easy: all instances are negative.
It is not obvious what is the best trade-off, but you will always reduce sensitivity to increase specificity and vice-versa.
One way to support the decision for the best trade off is to represent it in ROC space, and then use AUC to assess it, or ROC convex hull (ROCCH). Similarly, you can use the Youden index J.
These parameters are acceptable when equal importance is given to both sensitivity and specificity.
In some cases, eg the medical domain, this is not acceptable.
What is the trade off you are looking for?

 ps I hope I have been able to highlight the problem. If not, sorry for long post

2017-06-12 12:27 GMT+02:00 Marina Santini <[hidden email]>:
Thanks everybody for insightful answers. 

I am trying out many different solutions because the aim is to use a ml model for a real world situation. 
I tried binary classification, but my impression is that it can be even harder. 

With 3 classes, I get this results with simple NB-k. 

=== Stratified cross-validation ===
=== Summary ===

Correctly Classified Instances       57727               52.699  %
Incorrectly Classified Instances     51814               47.301  %
Kappa statistic                          0.1662
Mean absolute error                      0.3679
Root mean squared error                  0.4314
Relative absolute error                 93.7958 %
Root relative squared error             97.4222 %
Total Number of Instances           109541     

=== Detailed Accuracy By Class ===

                 TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
                 0,701    0,452    0,530      0,701    0,603      0,247    0,663     0,555     SHORT
                 0,468    0,365    0,534      0,468    0,499      0,104    0,573     0,532     MEDIUM
                 0,106    0,021    0,378      0,106    0,166      0,154    0,733     0,264     LONG
Weighted Avg.    0,527    0,365    0,516      0,527    0,507      0,170    0,628     0,513     

=== Confusion Matrix ===

     a     b     c   <-- classified as
 32283 13470   330 |     a = SHORT
 25846 24204  1708 |     b = MEDIUM
  2834  7626  1240 |     c = LONG

My aim is to get as ac and aa as possible. Basically I want that the classifiers get the short and long classes as correct as possible. 

In the confusion matrix above, if I could get ac=300, I would start being happy :-)  That's why I was trying to play with weights (ie cost). 

Cheers, Marina


On 12 June 2017 at 12:16, Eibe Frank <[hidden email]> wrote:
No, you are right, you are unlikely to get an improvement overall, at least with the minimum expected cost approach. The default cost matrix, assigning the same cost to every type of error, should be best in terms of overall misclassification error, unless the classifier's probability estimates are inaccurate.

Cheers,
Eibe

> On 12/06/2017, at 8:42 PM, Michael Hall <[hidden email]> wrote:
>
>
>> On Jun 11, 2017, at 8:47 AM, Marina Santini <[hidden email]> wrote:
>>
>>
>> Now I am focusing on cost-sensitive classification,
>
> This is not a normal use of a cost matrix and probably doesn’t improve overall classification performance does it?
> My understanding is that normally a cost matrix is used if there is a different degree of concern for different classification errors?
> Say ‘customers hair turns orange’ is considered very bad to get wrong so you make it more costly.
> When I tried it, it did seem to improve classification for the higher cost entries but didn’t seem to result in improved overall performance.
> The error rate for other outcomes suffered?
> If I remember seeming to result in the classifier getting about the same overall performance as it did without the cost matrix.
> Are there circumstances where this will improve overall?
>
> Mike Hall
>
> _______________________________________________
> Wekalist mailing list
> Send posts to: [hidden email]
> List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
> List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html


_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html




--
Davide Barbieri

http://docente.unife.it/davide.barbieri/

Universita' di Ferrara - http://www.unife.it/

_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html



_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html




--
Davide Barbieri

http://docente.unife.it/davide.barbieri/

Universita' di Ferrara - http://www.unife.it/

_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Unbalanced dataset and multiclass cost matrix

Davide Barbieri
In reply to this post by Marina Santini

2017-06-12 15:27 GMT+02:00 Marina Santini <[hidden email]>:
Hi all, 

I am pondering myself on how i can frame the classification problem properly. As I said, it is a real world problem and I cannot disclose the details. 

I will try to describe the problem using an example about "hospitalization".
Say that: 
The purpose of the classification is to predict people prone to long hospitalization. The focus is on the LONG HOSPITALIZATION (the minority class)
Hospitals want to know that in advance about who is prone to long hospitalization because they want to be ready with number of beds, nurses,  and so on. All this costs money and they want to be able to have an idea of the amount of money they have to invest in the future maybe may be in alternative solutions such as eCare at home and similar (well... hope this makes sense... :-) )

A dataset containing records about people who have been hospitalized  in the previous years is available. The number of days of hospitalization has been recorded.

10 nominal (quick weak) predictors are available, like gender, age, employment etc. 

The boundary between a normal hospitalization and a long hospitalization is 3 months. 

So first I tried a binary classification: <=90days vs >90 days. 

The distribution is the following 

Inline images 1


On this dataset, DT performs random, RandomForest poor, and NB-k produces the (best) following results:

=== Stratified cross-validation ===
=== Summary ===

Correctly Classified Instances       97002               88.5531 %
Incorrectly Classified Instances     12539               11.4469 %
Kappa statistic                          0.1163
Mean absolute error                      0.1737
Root mean squared error                  0.2995
Relative absolute error                 91.0125 %
Root relative squared error             96.9781 %
Total Number of Instances           109541     

=== Detailed Accuracy By Class ===

                 TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
                 0,980    0,901    0,901      0,980    0,939      0,146    0,732     0,950     NORMAL
                 0,099    0,020    0,367      0,099    0,156      0,146    0,732     0,260     LONG
Weighted Avg.    0,886    0,807    0,844      0,886    0,855      0,146    0,732     0,876     

=== Confusion Matrix ===

     a     b   <-- classified as
 95839  2002 |     a = NORMAL
 10537  1163 |     b = LONG


The LONG class has low precision and very poor recall.  

I tried to binarize the predictor with marginal improvement of the minority class: 

=== Detailed Accuracy By Class ===

                 TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
                 0,921    0,714    0,915      0,921    0,918      0,212    0,729     0,950     NORMAL
                 0,286    0,079    0,302      0,286    0,294      0,212    0,729     0,248     LONG
Weighted Avg.    0,853    0,646    0,850      0,853    0,851      0,212    0,729     0,875     

=== Confusion Matrix ===

     a     b   <-- classified as
 90096  7745 |     a = NORMAL
  8355  3345 |     b = LONG

Then I tried all your suggestions and many more. For instance SMOTE gives the following:


                 TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
                 0,939    0,751    0,913      0,939    0,926      0,214    0,732     0,950     NORMAL
                 0,249    0,061    0,330      0,249    0,284      0,214    0,732     0,256     LONG
Weighted Avg.    0,866    0,677    0,850      0,866    0,857      0,214    0,732     0,876     

=== Confusion Matrix ===

     a     b   <-- classified as
 91904  5937 |     a = NORMAL
  8782  2918 |     b = LONG


Then, feeling frustrated, I thought that I could create an intermediary class to collect all the LONG hospitalizations that tend to be misclassified in the binary classification. 
The assumption is that it might be more useful to know clearly if somebody is going to be a short hospitalized or a long  hospitalized, leaving the MEDIUM class as a grey area where the healthcare staff must possibly carry out more investigations. 

So I came up with the 3 classes of my previous email and this explains (I hope) why I would like to decrease the ac cell (2834) and the aa  cell  (330) of the confusion matrix I reported in my previous mail : hospitals allocate different money for short and long hospitalizations, so the would like accurate predictions on these classes or at least accurate predictions on the LONG hospitalizations. 

hope this fictitious example make sense to you

Cheers, Marina




On 12 June 2017 at 12:42, Davide Barbieri <[hidden email]> wrote:
Hello Marina

sorry, but the problem is not so clear to me.
When classes are imbalanced, it is always difficult to understand which is the best parameter to assess classification performances, even in a basic binary problem.
For example it would be very easy to have a perfect TPR=100%, if you declare every instance to be positive (eg. minority class).
It is a so-called "cry wolf situation". But you would have TNR=0, which is usually not acceptable. Also TNR=100% and TPR=0 is easy: all instances are negative.
It is not obvious what is the best trade-off, but you will always reduce sensitivity to increase specificity and vice-versa.
One way to support the decision for the best trade off is to represent it in ROC space, and then use AUC to assess it, or ROC convex hull (ROCCH). Similarly, you can use the Youden index J.
These parameters are acceptable when equal importance is given to both sensitivity and specificity.
In some cases, eg the medical domain, this is not acceptable.
What is the trade off you are looking for?

 ps I hope I have been able to highlight the problem. If not, sorry for long post

2017-06-12 12:27 GMT+02:00 Marina Santini <[hidden email]>:
Thanks everybody for insightful answers. 

I am trying out many different solutions because the aim is to use a ml model for a real world situation. 
I tried binary classification, but my impression is that it can be even harder. 

With 3 classes, I get this results with simple NB-k. 

=== Stratified cross-validation ===
=== Summary ===

Correctly Classified Instances       57727               52.699  %
Incorrectly Classified Instances     51814               47.301  %
Kappa statistic                          0.1662
Mean absolute error                      0.3679
Root mean squared error                  0.4314
Relative absolute error                 93.7958 %
Root relative squared error             97.4222 %
Total Number of Instances           109541     

=== Detailed Accuracy By Class ===

                 TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
                 0,701    0,452    0,530      0,701    0,603      0,247    0,663     0,555     SHORT
                 0,468    0,365    0,534      0,468    0,499      0,104    0,573     0,532     MEDIUM
                 0,106    0,021    0,378      0,106    0,166      0,154    0,733     0,264     LONG
Weighted Avg.    0,527    0,365    0,516      0,527    0,507      0,170    0,628     0,513     

=== Confusion Matrix ===

     a     b     c   <-- classified as
 32283 13470   330 |     a = SHORT
 25846 24204  1708 |     b = MEDIUM
  2834  7626  1240 |     c = LONG

My aim is to get as ac and aa as possible. Basically I want that the classifiers get the short and long classes as correct as possible. 

In the confusion matrix above, if I could get ac=300, I would start being happy :-)  That's why I was trying to play with weights (ie cost). 

Cheers, Marina


On 12 June 2017 at 12:16, Eibe Frank <[hidden email]> wrote:
No, you are right, you are unlikely to get an improvement overall, at least with the minimum expected cost approach. The default cost matrix, assigning the same cost to every type of error, should be best in terms of overall misclassification error, unless the classifier's probability estimates are inaccurate.

Cheers,
Eibe

> On 12/06/2017, at 8:42 PM, Michael Hall <[hidden email]> wrote:
>
>
>> On Jun 11, 2017, at 8:47 AM, Marina Santini <[hidden email]> wrote:
>>
>>
>> Now I am focusing on cost-sensitive classification,
>
> This is not a normal use of a cost matrix and probably doesn’t improve overall classification performance does it?
> My understanding is that normally a cost matrix is used if there is a different degree of concern for different classification errors?
> Say ‘customers hair turns orange’ is considered very bad to get wrong so you make it more costly.
> When I tried it, it did seem to improve classification for the higher cost entries but didn’t seem to result in improved overall performance.
> The error rate for other outcomes suffered?
> If I remember seeming to result in the classifier getting about the same overall performance as it did without the cost matrix.
> Are there circumstances where this will improve overall?
>
> Mike Hall
>
> _______________________________________________
> Wekalist mailing list
> Send posts to: [hidden email]
> List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
> List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html


_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html




--
Davide Barbieri

http://docente.unife.it/davide.barbieri/

Universita' di Ferrara - http://www.unife.it/

_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html



_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html




--
Davide Barbieri

http://docente.unife.it/davide.barbieri/

Universita' di Ferrara - http://www.unife.it/

_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Unbalanced dataset and multiclass cost matrix

Marina Santini
Thanks Davide for the reading suggestions. I will certainly read the papers. 


However, I am not sure I understand what I am supposed to do withing weka when you say: 
"Be sure to apply SMOTE with the necessary rate of oversampling and THEN undersample the majority class.
Go for 1:1 LONG-to-NORMAL ratio. "

I am using a filtered NB-k classifier with SMOTE (see figure below). In order to increase the PRC area,  shall I increase the SMOTE percentage to 200% and set the knn to 3? A good tradeoff for my task would be to get 0.7precision & 0.4recall.



Inline images 1

 Thanks again for your suggestions!

Cheers, Marina

On 12 June 2017 at 16:08, Davide Barbieri <[hidden email]> wrote:

2017-06-12 15:27 GMT+02:00 Marina Santini <[hidden email]>:
Hi all, 

I am pondering myself on how i can frame the classification problem properly. As I said, it is a real world problem and I cannot disclose the details. 

I will try to describe the problem using an example about "hospitalization".
Say that: 
The purpose of the classification is to predict people prone to long hospitalization. The focus is on the LONG HOSPITALIZATION (the minority class)
Hospitals want to know that in advance about who is prone to long hospitalization because they want to be ready with number of beds, nurses,  and so on. All this costs money and they want to be able to have an idea of the amount of money they have to invest in the future maybe may be in alternative solutions such as eCare at home and similar (well... hope this makes sense... :-) )

A dataset containing records about people who have been hospitalized  in the previous years is available. The number of days of hospitalization has been recorded.

10 nominal (quick weak) predictors are available, like gender, age, employment etc. 

The boundary between a normal hospitalization and a long hospitalization is 3 months. 

So first I tried a binary classification: <=90days vs >90 days. 

The distribution is the following 

Inline images 1


On this dataset, DT performs random, RandomForest poor, and NB-k produces the (best) following results:

=== Stratified cross-validation ===
=== Summary ===

Correctly Classified Instances       97002               88.5531 %
Incorrectly Classified Instances     12539               11.4469 %
Kappa statistic                          0.1163
Mean absolute error                      0.1737
Root mean squared error                  0.2995
Relative absolute error                 91.0125 %
Root relative squared error             96.9781 %
Total Number of Instances           109541     

=== Detailed Accuracy By Class ===

                 TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
                 0,980    0,901    0,901      0,980    0,939      0,146    0,732     0,950     NORMAL
                 0,099    0,020    0,367      0,099    0,156      0,146    0,732     0,260     LONG
Weighted Avg.    0,886    0,807    0,844      0,886    0,855      0,146    0,732     0,876     

=== Confusion Matrix ===

     a     b   <-- classified as
 95839  2002 |     a = NORMAL
 10537  1163 |     b = LONG


The LONG class has low precision and very poor recall.  

I tried to binarize the predictor with marginal improvement of the minority class: 

=== Detailed Accuracy By Class ===

                 TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
                 0,921    0,714    0,915      0,921    0,918      0,212    0,729     0,950     NORMAL
                 0,286    0,079    0,302      0,286    0,294      0,212    0,729     0,248     LONG
Weighted Avg.    0,853    0,646    0,850      0,853    0,851      0,212    0,729     0,875     

=== Confusion Matrix ===

     a     b   <-- classified as
 90096  7745 |     a = NORMAL
  8355  3345 |     b = LONG

Then I tried all your suggestions and many more. For instance SMOTE gives the following:


                 TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
                 0,939    0,751    0,913      0,939    0,926      0,214    0,732     0,950     NORMAL
                 0,249    0,061    0,330      0,249    0,284      0,214    0,732     0,256     LONG
Weighted Avg.    0,866    0,677    0,850      0,866    0,857      0,214    0,732     0,876     

=== Confusion Matrix ===

     a     b   <-- classified as
 91904  5937 |     a = NORMAL
  8782  2918 |     b = LONG


Then, feeling frustrated, I thought that I could create an intermediary class to collect all the LONG hospitalizations that tend to be misclassified in the binary classification. 
The assumption is that it might be more useful to know clearly if somebody is going to be a short hospitalized or a long  hospitalized, leaving the MEDIUM class as a grey area where the healthcare staff must possibly carry out more investigations. 

So I came up with the 3 classes of my previous email and this explains (I hope) why I would like to decrease the ac cell (2834) and the aa  cell  (330) of the confusion matrix I reported in my previous mail : hospitals allocate different money for short and long hospitalizations, so the would like accurate predictions on these classes or at least accurate predictions on the LONG hospitalizations. 

hope this fictitious example make sense to you

Cheers, Marina




On 12 June 2017 at 12:42, Davide Barbieri <[hidden email]> wrote:
Hello Marina

sorry, but the problem is not so clear to me.
When classes are imbalanced, it is always difficult to understand which is the best parameter to assess classification performances, even in a basic binary problem.
For example it would be very easy to have a perfect TPR=100%, if you declare every instance to be positive (eg. minority class).
It is a so-called "cry wolf situation". But you would have TNR=0, which is usually not acceptable. Also TNR=100% and TPR=0 is easy: all instances are negative.
It is not obvious what is the best trade-off, but you will always reduce sensitivity to increase specificity and vice-versa.
One way to support the decision for the best trade off is to represent it in ROC space, and then use AUC to assess it, or ROC convex hull (ROCCH). Similarly, you can use the Youden index J.
These parameters are acceptable when equal importance is given to both sensitivity and specificity.
In some cases, eg the medical domain, this is not acceptable.
What is the trade off you are looking for?

 ps I hope I have been able to highlight the problem. If not, sorry for long post

2017-06-12 12:27 GMT+02:00 Marina Santini <[hidden email]>:
Thanks everybody for insightful answers. 

I am trying out many different solutions because the aim is to use a ml model for a real world situation. 
I tried binary classification, but my impression is that it can be even harder. 

With 3 classes, I get this results with simple NB-k. 

=== Stratified cross-validation ===
=== Summary ===

Correctly Classified Instances       57727               52.699  %
Incorrectly Classified Instances     51814               47.301  %
Kappa statistic                          0.1662
Mean absolute error                      0.3679
Root mean squared error                  0.4314
Relative absolute error                 93.7958 %
Root relative squared error             97.4222 %
Total Number of Instances           109541     

=== Detailed Accuracy By Class ===

                 TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
                 0,701    0,452    0,530      0,701    0,603      0,247    0,663     0,555     SHORT
                 0,468    0,365    0,534      0,468    0,499      0,104    0,573     0,532     MEDIUM
                 0,106    0,021    0,378      0,106    0,166      0,154    0,733     0,264     LONG
Weighted Avg.    0,527    0,365    0,516      0,527    0,507      0,170    0,628     0,513     

=== Confusion Matrix ===

     a     b     c   <-- classified as
 32283 13470   330 |     a = SHORT
 25846 24204  1708 |     b = MEDIUM
  2834  7626  1240 |     c = LONG

My aim is to get as ac and aa as possible. Basically I want that the classifiers get the short and long classes as correct as possible. 

In the confusion matrix above, if I could get ac=300, I would start being happy :-)  That's why I was trying to play with weights (ie cost). 

Cheers, Marina


On 12 June 2017 at 12:16, Eibe Frank <[hidden email]> wrote:
No, you are right, you are unlikely to get an improvement overall, at least with the minimum expected cost approach. The default cost matrix, assigning the same cost to every type of error, should be best in terms of overall misclassification error, unless the classifier's probability estimates are inaccurate.

Cheers,
Eibe

> On 12/06/2017, at 8:42 PM, Michael Hall <[hidden email]> wrote:
>
>
>> On Jun 11, 2017, at 8:47 AM, Marina Santini <[hidden email]> wrote:
>>
>>
>> Now I am focusing on cost-sensitive classification,
>
> This is not a normal use of a cost matrix and probably doesn’t improve overall classification performance does it?
> My understanding is that normally a cost matrix is used if there is a different degree of concern for different classification errors?
> Say ‘customers hair turns orange’ is considered very bad to get wrong so you make it more costly.
> When I tried it, it did seem to improve classification for the higher cost entries but didn’t seem to result in improved overall performance.
> The error rate for other outcomes suffered?
> If I remember seeming to result in the classifier getting about the same overall performance as it did without the cost matrix.
> Are there circumstances where this will improve overall?
>
> Mike Hall
>
> _______________________________________________
> Wekalist mailing list
> Send posts to: [hidden email]
> List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
> List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html


_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html




--
Davide Barbieri

http://docente.unife.it/davide.barbieri/

Universita' di Ferrara - http://www.unife.it/

_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html



_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html




--
Davide Barbieri

http://docente.unife.it/davide.barbieri/

Universita' di Ferrara - http://www.unife.it/

_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html



_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Unbalanced dataset and multiclass cost matrix

Davide Barbieri
Choose filtered classifier, then multi filter. Apply first SMOTE and then Spreadsubsample:

Immagine incorporata 1

 
I am using a filtered NB-k classifier with SMOTE (see figure below). In order to increase the PRC area,  shall I increase the SMOTE percentage to 200% and set the knn to 3?

keep k=5 and percentage to 100%. Spreadsubsample distribution spread =1.
Eventually, increase oversampling to improve sensitivity to minority class.
 
A good tradeoff for my task would be to get 0.7precision & 0.4recall.

Precision or PPV will always be low if prevalence of the condition is low: PPV=TP/(TP+FP).
This is the problem of imbalanced datasets. Of course, it depends on the class.
LONG has low prevalence. Even if you have a high (close to 1 or 100%) TPR (sensitivity) PPV will be low.
This is why we dont use it in assessing performances of classifiers on impbalanced datasets.
ROC represents a trade off b/w TNR and TPR.

Given the same HIV test (with the same sensitivity), in a country with HIV prevalence close to 0 its PPV will be close to 0.

--
Davide Barbieri

http://docente.unife.it/davide.barbieri/

Universita' di Ferrara - http://www.unife.it/

_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Unbalanced dataset and multiclass cost matrix

Marina Santini
Thanks Davide!

I ran:
weka.classifiers.meta.FilteredClassifier -F "weka.filters.MultiFilter -F \"weka.filters.supervised.instance.SMOTE -C 0 -K 5 -P 100.0 -S 1\" -F \"weka.filters.supervised.instance.SpreadSubsample -M 0.0 -X 0.0 -S 1\"" -W weka.classifiers.bayes.NaiveBayes -- -K

Here the results (distribution 0.0): 
=== Stratified cross-validation ===
=== Summary ===

Correctly Classified Instances       94822               86.563  %
Incorrectly Classified Instances     14719               13.437  %
Kappa statistic                          0.2113
Mean absolute error                      0.2182
Root mean squared error                  0.3177
Relative absolute error                114.3666 %
Root relative squared error            102.8429 %
Total Number of Instances           109541     

=== Detailed Accuracy By Class ===

                 TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
                 0,939    0,751    0,913      0,939    0,926      0,214    0,732     0,950     NORMAL
                 0,249    0,061    0,330      0,249    0,284      0,214    0,732     0,256     LONG
Weighted Avg.    0,866    0,677    0,850      0,866    0,857      0,214    0,732     0,876     

=== Confusion Matrix ===

     a     b   <-- classified as
 91904  5937 |     a = NORMAL
  8782  2918 |     b = LONG


then I tried (1.1)
Scheme:       weka.classifiers.meta.FilteredClassifier -F "weka.filters.MultiFilter -F \"weka.filters.supervised.instance.SMOTE -C 0 -K 5 -P 100.0 -S 1\" -F \"weka.filters.supervised.instance.SpreadSubsample -M 1.1 -X 0.0 -S 1\"" -W weka.classifiers.bayes.NaiveBayes -- -K

here are the results
=== Stratified cross-validation ===
=== Summary ===

Correctly Classified Instances       82279               75.1125 %
Incorrectly Classified Instances     27262               24.8875 %
Kappa statistic                          0.2151
Mean absolute error                      0.3729
Root mean squared error                  0.4293
Relative absolute error                195.4239 %
Root relative squared error            139.0055 %
Total Number of Instances           109541     

=== Detailed Accuracy By Class ===

                 TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
                 0,771    0,413    0,940      0,771    0,847      0,249    0,732     0,950     NORMAL
                 0,587    0,229    0,234      0,587    0,335      0,249    0,732     0,256     LONG
Weighted Avg.    0,751    0,394    0,864      0,751    0,792      0,249    0,732     0,876     

=== Confusion Matrix ===

     a     b   <-- classified as
 75415 22426 |     a = NORMAL
  4836  6864 |     b = LONG


just for your info and if you are curious about the behaviour of this dataset.

Thanks again

Cheers, Marina



On 12 June 2017 at 16:53, Davide Barbieri <[hidden email]> wrote:
Choose filtered classifier, then multi filter. Apply first SMOTE and then Spreadsubsample:

Immagine incorporata 1

 
I am using a filtered NB-k classifier with SMOTE (see figure below). In order to increase the PRC area,  shall I increase the SMOTE percentage to 200% and set the knn to 3?

keep k=5 and percentage to 100%. Spreadsubsample distribution spread =1.
Eventually, increase oversampling to improve sensitivity to minority class.
 
A good tradeoff for my task would be to get 0.7precision & 0.4recall.

Precision or PPV will always be low if prevalence of the condition is low: PPV=TP/(TP+FP).
This is the problem of imbalanced datasets. Of course, it depends on the class.
LONG has low prevalence. Even if you have a high (close to 1 or 100%) TPR (sensitivity) PPV will be low.
This is why we dont use it in assessing performances of classifiers on impbalanced datasets.
ROC represents a trade off b/w TNR and TPR.

Given the same HIV test (with the same sensitivity), in a country with HIV prevalence close to 0 its PPV will be close to 0.

--
Davide Barbieri

http://docente.unife.it/davide.barbieri/

Universita' di Ferrara - http://www.unife.it/

_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html



_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Unbalanced dataset and multiclass cost matrix

Davide Barbieri
Good!
As you can see, your TPR(long) improved (0.25 to 0.59) more than twice, while the reduction of TPR(normal) was 0.94-0.77=0.17.
So J went from 0.94+0.25-1=0.19 to 0.77+0.59-1=0.36 (it almost doubled). 

But you should first consider the following question: do sensitivity and specificity have the same weight or not?
If not, how can you quantify the difference?

2017-06-13 7:30 GMT+02:00 Marina Santini <[hidden email]>:
Thanks Davide!

I ran:
weka.classifiers.meta.FilteredClassifier -F "weka.filters.MultiFilter -F \"weka.filters.supervised.instance.SMOTE -C 0 -K 5 -P 100.0 -S 1\" -F \"weka.filters.supervised.instance.SpreadSubsample -M 0.0 -X 0.0 -S 1\"" -W weka.classifiers.bayes.NaiveBayes -- -K

Here the results (distribution 0.0): 
=== Stratified cross-validation ===
=== Summary ===

Correctly Classified Instances       94822               86.563  %
Incorrectly Classified Instances     14719               13.437  %
Kappa statistic                          0.2113
Mean absolute error                      0.2182
Root mean squared error                  0.3177
Relative absolute error                114.3666 %
Root relative squared error            102.8429 %
Total Number of Instances           109541     

=== Detailed Accuracy By Class ===

                 TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
                 0,939    0,751    0,913      0,939    0,926      0,214    0,732     0,950     NORMAL
                 0,249    0,061    0,330      0,249    0,284      0,214    0,732     0,256     LONG
Weighted Avg.    0,866    0,677    0,850      0,866    0,857      0,214    0,732     0,876     

=== Confusion Matrix ===

     a     b   <-- classified as
 91904  5937 |     a = NORMAL
  8782  2918 |     b = LONG


then I tried (1.1)
Scheme:       weka.classifiers.meta.FilteredClassifier -F "weka.filters.MultiFilter -F \"weka.filters.supervised.instance.SMOTE -C 0 -K 5 -P 100.0 -S 1\" -F \"weka.filters.supervised.instance.SpreadSubsample -M 1.1 -X 0.0 -S 1\"" -W weka.classifiers.bayes.NaiveBayes -- -K

here are the results
=== Stratified cross-validation ===
=== Summary ===

Correctly Classified Instances       82279               75.1125 %
Incorrectly Classified Instances     27262               24.8875 %
Kappa statistic                          0.2151
Mean absolute error                      0.3729
Root mean squared error                  0.4293
Relative absolute error                195.4239 %
Root relative squared error            139.0055 %
Total Number of Instances           109541     

=== Detailed Accuracy By Class ===

                 TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
                 0,771    0,413    0,940      0,771    0,847      0,249    0,732     0,950     NORMAL
                 0,587    0,229    0,234      0,587    0,335      0,249    0,732     0,256     LONG
Weighted Avg.    0,751    0,394    0,864      0,751    0,792      0,249    0,732     0,876     

=== Confusion Matrix ===

     a     b   <-- classified as
 75415 22426 |     a = NORMAL
  4836  6864 |     b = LONG


just for your info and if you are curious about the behaviour of this dataset.

Thanks again

Cheers, Marina



On 12 June 2017 at 16:53, Davide Barbieri <[hidden email]> wrote:
Choose filtered classifier, then multi filter. Apply first SMOTE and then Spreadsubsample:

Immagine incorporata 1

 
I am using a filtered NB-k classifier with SMOTE (see figure below). In order to increase the PRC area,  shall I increase the SMOTE percentage to 200% and set the knn to 3?

keep k=5 and percentage to 100%. Spreadsubsample distribution spread =1.
Eventually, increase oversampling to improve sensitivity to minority class.
 
A good tradeoff for my task would be to get 0.7precision & 0.4recall.

Precision or PPV will always be low if prevalence of the condition is low: PPV=TP/(TP+FP).
This is the problem of imbalanced datasets. Of course, it depends on the class.
LONG has low prevalence. Even if you have a high (close to 1 or 100%) TPR (sensitivity) PPV will be low.
This is why we dont use it in assessing performances of classifiers on impbalanced datasets.
ROC represents a trade off b/w TNR and TPR.

Given the same HIV test (with the same sensitivity), in a country with HIV prevalence close to 0 its PPV will be close to 0.

--
Davide Barbieri

http://docente.unife.it/davide.barbieri/

Universita' di Ferrara - http://www.unife.it/

_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html



_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html




--
Davide Barbieri

http://docente.unife.it/davide.barbieri/

Universita' di Ferrara - http://www.unife.it/

_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Loading...