Bug in predictions output on Explorer Classifier Output Panel?

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Bug in predictions output on Explorer Classifier Output Panel?

SanjayPethe
Hello Again,
I am completely baffled by this output I am seeing. Can you explain the discrepancy, or have I uncovered a bug?

Running an InputMapped Classifier with J48 as the classifier in Weka 3.8. Have a separate Training Set of 4046 instances and a Test Set of 9016 instances.

I selected the output predictions as csv option when generating the model. The model statistics said 76.5% of the instances (6672) correctly classified. The actual vs. predicted output in the Explorer Classifier Output panel however, was not even close. The first 30 instances are shown below. In all, of the 9016 samples only 24 did NOT have a + i.e. meaning that only 24 of the 9016 had the predicted and actual match.

I then went to the visualize classifier errors panel and saved the output as an arff file and compared the predicted and actual from this arff file. This matched the statistics reported by the model. The same first 30 lines from the arff file are also shown below.

Seems to me that the output on the Classifier Output panel is messed up and not reporting correctly. Or am I missing something?

Regards,
Sanjay Pethe

Classifier Output Panel
inst#,actual,predicted,error,prediction
1,1:1263,16:1538,+,1
2,2:1611,9:1472,+,1
3,3:1267,36:1543,+,1
4,4:1445,13:1615,+,1
5,5:1564,1:1263,+,0.985
6,6:1488,27:1424,+,0.957
7,5:1564,11:1497,+,0.999
8,7:1489,21:1548,+,1
9,8:1533,49:1420,+,1
10,9:1472,1:1263,+,0.952
11,10:1606,12:1443,+,1
12,11:1497,20:1492,+,1
13,11:1497,20:1492,+,1
14,2:1611,9:1472,+,1
15,3:1267,15:1555,+,1
16,5:1564,11:1497,+,0.999
17,5:1564,11:1497,+,0.999
18,3:1267,15:1555,+,1
19,6:1488,1:1263,+,1
20,12:1443,29:1265,+,0.976
21,13:1615,11:1497,+,1
22,4:1445,13:1615,+,1
23,2:1611,8:1533,+,0.991
24,2:1611,8:1533,+,0.991
25,11:1497,20:1492,+,1
26,1:1263,16:1538,+,1
27,3:1267,15:1555,+,1
28,6:1488,1:1263,+,0.985
29,1:1263,16:1538,+,1
30,14:1469,8:1533,+,0.991

Arff File from Visualize Classifier Errors
WekaID ID 'prediction margin' 'predicted ELEMENT_3' ELEMENT_3
1 0 1 1263 1263
2 0 1 1611 1611
3 0 -1 1265 1267
4 0 1 1445 1445
5 0 -0.985294 1488 1564
6 0 -0.913043 1540 1488
7 0 0.997704 1564 1564
8 0 1 1489 1489
9 0 -1 1341 1533
10 0 -0.952381 1488 1472
11 0 1 1606 1606
12 0 1 1497 1497
13 0 1 1497 1497
14 0 1 1611 1611
15 0 1 1267 1267
16 0 0.997704 1564 1564
17 0 0.997704 1564 1564
18 0 1 1267 1267
19 0 1 1488 1488
20 0 0.952381 1443 1443
21 0 -1 1564 1615
22 0 1 1445 1445
23 0 -0.990859 1469 1611
24 0 -0.990859 1469 1611
25 0 1 1497 1497
26 0 1 1263 1263
27 0 1 1267 1267
28 0 0.970588 1488 1488
29 0 1 1263 1263
30 0 0.987203 1469 1469

Regards,
Sanjay Pethe

-----Original Message-----
From: Pethe, Sanjay
Sent: Thursday, February 16, 2017 8:22 AM
To: Weka machine learning workbench list. <[hidden email]>
Subject: RE: [EXTERNAL]Re: [Wekalist] Question on getting predictions from saved models from Weka

Mark,
Thank you for the prompt response - had not noticed it because the response had been sent to spam and it just occurred to me to look there.

I have not tried command line operations in the past, may give that a shot. I have tried something similar to what you suggest for the KF, but not using a Filtered Classifier. I have done the filtering separately upfront and then used an InputMapped Classifier, and this did not work. I'll try what you recommend with the FilteredClassifier and let you know if that worked.

Regards,
Sanjay Pethe

-----Original Message-----
From: [hidden email] [mailto:[hidden email]] On Behalf Of Mark Hall
Sent: Wednesday, February 15, 2017 2:06 AM
To: Weka machine learning workbench list. <[hidden email]>
Subject: [EXTERNAL]Re: [Wekalist] Question on getting predictions from saved models from Weka

I'm not too sure why this is proving difficult for you. Here are two examples:

1. Command line. Assuming I have a serialized model called test.model trained on the iris data

java weka.Run .AddClassification -serialized test.model -i ~/datasets/UCI/iris.arff -c last

add the –distribution flag if you want to see probability distributions instead of predicted labels

2. In the KnowledgeFlow

ArffLoader -dataset-> ClassAssigner -dataset-> TestSetMaker -testSet-> <classifier step that matches the saved classifier, configured with the path to test.model in "Classifier model to load" under "Additional options"> -batchClassifier-> PredictionAppender -testSet-> ArffSaver/CSVSaver/TextViewer etc.

The PredictionAppender can be configured to output probability distributions instead of labels too.

If you are using lots of filters for preprocessing, then you need to wrap these up with your chosen base classifier in a FilteredClassifier when building and saving your model. More than one filter can be conveniently specified by using a MultiFilter.

Cheers,
Mark.

On 15/02/17, 5:23 PM, "Pethe, Sanjay" <[hidden email] on behalf of [hidden email]> wrote:

    Hello,
    First of all, Thank You to all Weka developers for having created this product. It has helped me tremendously come up the steep learning curve in this area. I am working on a text classification problem, and have models running in both the Explorer and
    Knowledge Flow with an InputMapped Classifier using J48. I intend to experiment with classifiers soon.
     
    One area that I have found difficult in Weka is making predictions from saved models. I have done this in Explorer, but not had much success doing so in Knowledge Flow. The process for doing so however is arcane and really non intuitive. (I am referring
    to the methods mentioned at https://weka.wikispaces.com/Saving+and+loading+models and https://weka.wikispaces.com/Making+predictions).
    This is such a basic step that I am wondering if I am missing some simpler way to do this. I tried the AddClassification filter mentioned in the second article and it did not work for me.
     
    I would really like to know if there is a way to get this working in Knowledge flow in particular because that would allow me to perform other actions such as removing attributes as part of a single process. My initial trial set had about 30K instances
    that end up with about 8K-10K attributes after the StringToWordVector conversion. The dataset I eventually want to process (for prediction only) has about 4 million instances and I expect there to be about 10K attributes after the StringToWordVector conversion.
    The output I want is a csv or text file with the ID, predicted class and maybe some confidence measure, but certainly not all the 8K-10K attributes representing the words. Unfortunately the only way I know to get this is to use the Explorer to load a saved
    model, make the predictions, look at them in the visualizer and save all attributes to an arff file. Then I open the arff file remove the unwanted attributes and save as csv. Unfortunately, these arff files will be very large because of the thousands of extra
    word attributes.
     
    I could use the output attributes option and copy – paste from the classifier output, if I could find a way to include the instance ID in the output. However, is there a way to save this file directly instead of copy-paste? Don’t know whether there will
    be any size limit issues with this volume of data with copy-paste.
     
    Is there a better way to accomplish this than what I have outlined here? I am using Weka 3.8 on Windows 7.
     
    Regards,
    Sanjay Pethe
   
     
   
   
    _______________________________________________
    Wekalist mailing list
    Send posts to: [hidden email]
    List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
    List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
   


_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Bug in predictions output on Explorer Classifier Output Panel?

SanjayPethe
Further information on this ... this does seem to be a bug in the csv output process.

I was testing Marks's suggestions on using KF to make predictions and output them (discussed in a separate post - seems to be working great by the way, Thank You). I output the predictions in a csv format using the CSV saver from the KF for almost the same data (only difference was that I treated both the training set and test set as a test set in the prediction run instead of just using the test set) and got exactly the same results - actual and predicted matched for only 24 instances. I then added an arff saver to the flow, and read the arff file using Explorer and saved off as csv, and then compared the results. This matched the performance evaluator accuracy of 96% accurate.

Points to a problem in the actual output of csv data from the classifier, since the conversion from arff to csv seems to be working fine.

Regards,
Sanjay Pethe

-----Original Message-----
From: Pethe, Sanjay
Sent: Thursday, February 16, 2017 10:54 PM
To: '[hidden email]' <[hidden email]>
Subject: Bug in predictions output on Explorer Classifier Output Panel?

Hello Again,
I am completely baffled by this output I am seeing. Can you explain the discrepancy, or have I uncovered a bug?

Running an InputMapped Classifier with J48 as the classifier in Weka 3.8. Have a separate Training Set of 4046 instances and a Test Set of 9016 instances.

I selected the output predictions as csv option when generating the model. The model statistics said 76.5% of the instances (6672) correctly classified. The actual vs. predicted output in the Explorer Classifier Output panel however, was not even close. The first 30 instances are shown below. In all, of the 9016 samples only 24 did NOT have a + i.e. meaning that only 24 of the 9016 had the predicted and actual match.

I then went to the visualize classifier errors panel and saved the output as an arff file and compared the predicted and actual from this arff file. This matched the statistics reported by the model. The same first 30 lines from the arff file are also shown below.

Seems to me that the output on the Classifier Output panel is messed up and not reporting correctly. Or am I missing something?

Regards,
Sanjay Pethe

Classifier Output Panel
inst#,actual,predicted,error,prediction
1,1:1263,16:1538,+,1
2,2:1611,9:1472,+,1
3,3:1267,36:1543,+,1
4,4:1445,13:1615,+,1
5,5:1564,1:1263,+,0.985
6,6:1488,27:1424,+,0.957
7,5:1564,11:1497,+,0.999
8,7:1489,21:1548,+,1
9,8:1533,49:1420,+,1
10,9:1472,1:1263,+,0.952
11,10:1606,12:1443,+,1
12,11:1497,20:1492,+,1
13,11:1497,20:1492,+,1
14,2:1611,9:1472,+,1
15,3:1267,15:1555,+,1
16,5:1564,11:1497,+,0.999
17,5:1564,11:1497,+,0.999
18,3:1267,15:1555,+,1
19,6:1488,1:1263,+,1
20,12:1443,29:1265,+,0.976
21,13:1615,11:1497,+,1
22,4:1445,13:1615,+,1
23,2:1611,8:1533,+,0.991
24,2:1611,8:1533,+,0.991
25,11:1497,20:1492,+,1
26,1:1263,16:1538,+,1
27,3:1267,15:1555,+,1
28,6:1488,1:1263,+,0.985
29,1:1263,16:1538,+,1
30,14:1469,8:1533,+,0.991

Arff File from Visualize Classifier Errors
WekaID ID 'prediction margin' 'predicted ELEMENT_3' ELEMENT_3
1 0 1 1263 1263
2 0 1 1611 1611
3 0 -1 1265 1267
4 0 1 1445 1445
5 0 -0.985294 1488 1564
6 0 -0.913043 1540 1488
7 0 0.997704 1564 1564
8 0 1 1489 1489
9 0 -1 1341 1533
10 0 -0.952381 1488 1472
11 0 1 1606 1606
12 0 1 1497 1497
13 0 1 1497 1497
14 0 1 1611 1611
15 0 1 1267 1267
16 0 0.997704 1564 1564
17 0 0.997704 1564 1564
18 0 1 1267 1267
19 0 1 1488 1488
20 0 0.952381 1443 1443
21 0 -1 1564 1615
22 0 1 1445 1445
23 0 -0.990859 1469 1611
24 0 -0.990859 1469 1611
25 0 1 1497 1497
26 0 1 1263 1263
27 0 1 1267 1267
28 0 0.970588 1488 1488
29 0 1 1263 1263
30 0 0.987203 1469 1469

Regards,
Sanjay Pethe

-----Original Message-----
From: Pethe, Sanjay
Sent: Thursday, February 16, 2017 8:22 AM
To: Weka machine learning workbench list. <[hidden email]>
Subject: RE: [EXTERNAL]Re: [Wekalist] Question on getting predictions from saved models from Weka

Mark,
Thank you for the prompt response - had not noticed it because the response had been sent to spam and it just occurred to me to look there.

I have not tried command line operations in the past, may give that a shot. I have tried something similar to what you suggest for the KF, but not using a Filtered Classifier. I have done the filtering separately upfront and then used an InputMapped Classifier, and this did not work. I'll try what you recommend with the FilteredClassifier and let you know if that worked.

Regards,
Sanjay Pethe

-----Original Message-----
From: [hidden email] [mailto:[hidden email]] On Behalf Of Mark Hall
Sent: Wednesday, February 15, 2017 2:06 AM
To: Weka machine learning workbench list. <[hidden email]>
Subject: [EXTERNAL]Re: [Wekalist] Question on getting predictions from saved models from Weka

I'm not too sure why this is proving difficult for you. Here are two examples:

1. Command line. Assuming I have a serialized model called test.model trained on the iris data

java weka.Run .AddClassification -serialized test.model -i ~/datasets/UCI/iris.arff -c last

add the –distribution flag if you want to see probability distributions instead of predicted labels

2. In the KnowledgeFlow

ArffLoader -dataset-> ClassAssigner -dataset-> TestSetMaker -testSet-> <classifier step that matches the saved classifier, configured with the path to test.model in "Classifier model to load" under "Additional options"> -batchClassifier-> PredictionAppender -testSet-> ArffSaver/CSVSaver/TextViewer etc.

The PredictionAppender can be configured to output probability distributions instead of labels too.

If you are using lots of filters for preprocessing, then you need to wrap these up with your chosen base classifier in a FilteredClassifier when building and saving your model. More than one filter can be conveniently specified by using a MultiFilter.

Cheers,
Mark.

On 15/02/17, 5:23 PM, "Pethe, Sanjay" <[hidden email] on behalf of [hidden email]> wrote:

    Hello,
    First of all, Thank You to all Weka developers for having created this product. It has helped me tremendously come up the steep learning curve in this area. I am working on a text classification problem, and have models running in both the Explorer and
    Knowledge Flow with an InputMapped Classifier using J48. I intend to experiment with classifiers soon.
     
    One area that I have found difficult in Weka is making predictions from saved models. I have done this in Explorer, but not had much success doing so in Knowledge Flow. The process for doing so however is arcane and really non intuitive. (I am referring
    to the methods mentioned at https://weka.wikispaces.com/Saving+and+loading+models and https://weka.wikispaces.com/Making+predictions).
    This is such a basic step that I am wondering if I am missing some simpler way to do this. I tried the AddClassification filter mentioned in the second article and it did not work for me.
     
    I would really like to know if there is a way to get this working in Knowledge flow in particular because that would allow me to perform other actions such as removing attributes as part of a single process. My initial trial set had about 30K instances
    that end up with about 8K-10K attributes after the StringToWordVector conversion. The dataset I eventually want to process (for prediction only) has about 4 million instances and I expect there to be about 10K attributes after the StringToWordVector conversion.
    The output I want is a csv or text file with the ID, predicted class and maybe some confidence measure, but certainly not all the 8K-10K attributes representing the words. Unfortunately the only way I know to get this is to use the Explorer to load a saved
    model, make the predictions, look at them in the visualizer and save all attributes to an arff file. Then I open the arff file remove the unwanted attributes and save as csv. Unfortunately, these arff files will be very large because of the thousands of extra
    word attributes.
     
    I could use the output attributes option and copy – paste from the classifier output, if I could find a way to include the instance ID in the output. However, is there a way to save this file directly instead of copy-paste? Don’t know whether there will
    be any size limit issues with this volume of data with copy-paste.
     
    Is there a better way to accomplish this than what I have outlined here? I am using Weka 3.8 on Windows 7.
     
    Regards,
    Sanjay Pethe
   
     
   
   
    _______________________________________________
    Wekalist mailing list
    Send posts to: [hidden email]
    List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
    List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
   


_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html