RandomForest predictions

classic Classic list List threaded Threaded
7 messages Options
Tom
Reply | Threaded
Open this post in threaded view
|

RandomForest predictions

Tom
Hello,

I've created a RandomForest model in training phase (nominal class, 2 values), then saved it as a classifier object.
Later, using a different dataset, I load this classifier object, and try to run the Classifier.distributionForInstance(Instance instance) method on new instances that I've loaded, one by one.

This fails on my new dataset!

The reason is that the Bagging class demands this:

@Override
public double[] distributionForInstance(Instance instance) throws Exception {

double[] sums = new double[instance.numClasses()], newProbs;

The numClasses() call checks "what is the class attribute" and this crashes, because my new dataset of course doesn't have a class attribute, that's exactly the point.

Using debugging and hacking to get past that line, I've found that the relevant code a few lines down:
newProbs = m_Classifiers[i].distributionForInstance(instance);
for (int j = 0; j < newProbs.length; j++)
sums[j] += newProbs[j];
actually works fine. the newProbs array contains 2 values with distributions, which is what I want in the end.

How can I overcome this?

Best regards,
  Tom

_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: RandomForest predictions

Peter Reutemann
Training and test set (or in your case data to be predicted) require
the exact same structure.

You might ask why and the reason is quite simple: at prediction time,
the classifier might need to inquire how many class labels are there.
Hence it needs the full structure, including the class attribute.

At prediction time, you simply provide a missing value
(weka.core.Utils.missingValue()) as class value in your Instance
object.

Cheers, Peter

On Fri, Mar 12, 2021 at 8:49 AM Tom <[hidden email]> wrote:

>
> Hello,
>
> I've created a RandomForest model in training phase (nominal class, 2 values), then saved it as a classifier object.
> Later, using a different dataset, I load this classifier object, and try to run the Classifier.distributionForInstance(Instance instance) method on new instances that I've loaded, one by one.
>
> This fails on my new dataset!
>
> The reason is that the Bagging class demands this:
>
> @Override
> public double[] distributionForInstance(Instance instance) throws Exception {
>
>   double[] sums = new double[instance.numClasses()], newProbs;
>
>
> The numClasses() call checks "what is the class attribute" and this crashes, because my new dataset of course doesn't have a class attribute, that's exactly the point.
>
> Using debugging and hacking to get past that line, I've found that the relevant code a few lines down:
>
> newProbs = m_Classifiers[i].distributionForInstance(instance);
> for (int j = 0; j < newProbs.length; j++)
>   sums[j] += newProbs[j];
>
> actually works fine. the newProbs array contains 2 values with distributions, which is what I want in the end.
>
> How can I overcome this?
>
> Best regards,
>   Tom
> _______________________________________________
> Wekalist mailing list -- [hidden email]
> Send posts to [hidden email]
> To unsubscribe send an email to [hidden email]
> To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
> List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html



--
Peter Reutemann
Dept. of Computer Science
University of Waikato, NZ
+64 (7) 577-5304
http://www.cms.waikato.ac.nz/~fracpete/
http://www.data-mining.co.nz/
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Tom
Reply | Threaded
Open this post in threaded view
|

Re: RandomForest predictions

Tom
Hi,

Okay, so to do that, I load my prediction dataset, then add a fake "class" column to the instances somehow, and give them all the value of "missing value"? How then would the classifier know that there are 2 classes in my case, or 3 in the next case?

Best regards,
  Tom

On Thu, Mar 11, 2021 at 8:55 PM Peter Reutemann <[hidden email]> wrote:
Training and test set (or in your case data to be predicted) require
the exact same structure.

You might ask why and the reason is quite simple: at prediction time,
the classifier might need to inquire how many class labels are there.
Hence it needs the full structure, including the class attribute.

At prediction time, you simply provide a missing value
(weka.core.Utils.missingValue()) as class value in your Instance
object.

Cheers, Peter

On Fri, Mar 12, 2021 at 8:49 AM Tom <[hidden email]> wrote:
>
> Hello,
>
> I've created a RandomForest model in training phase (nominal class, 2 values), then saved it as a classifier object.
> Later, using a different dataset, I load this classifier object, and try to run the Classifier.distributionForInstance(Instance instance) method on new instances that I've loaded, one by one.
>
> This fails on my new dataset!
>
> The reason is that the Bagging class demands this:
>
> @Override
> public double[] distributionForInstance(Instance instance) throws Exception {
>
>   double[] sums = new double[instance.numClasses()], newProbs;
>
>
> The numClasses() call checks "what is the class attribute" and this crashes, because my new dataset of course doesn't have a class attribute, that's exactly the point.
>
> Using debugging and hacking to get past that line, I've found that the relevant code a few lines down:
>
> newProbs = m_Classifiers[i].distributionForInstance(instance);
> for (int j = 0; j < newProbs.length; j++)
>   sums[j] += newProbs[j];
>
> actually works fine. the newProbs array contains 2 values with distributions, which is what I want in the end.
>
> How can I overcome this?
>
> Best regards,
>   Tom
> _______________________________________________
> Wekalist mailing list -- [hidden email]
> Send posts to [hidden email]
> To unsubscribe send an email to [hidden email]
> To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
> List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html



--
Peter Reutemann
Dept. of Computer Science
University of Waikato, NZ
+64 (7) 577-5304
http://www.cms.waikato.ac.nz/~fracpete/
http://www.data-mining.co.nz/
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: RandomForest predictions

Peter Reutemann
> Okay, so to do that, I load my prediction dataset, then add a fake "class" column to the instances somehow, and give them all the value of "missing value"? How then would the classifier know that there are 2 classes in my case, or 3 in the next case?

Every Instance object has a reference to an Instances object, which
contains the Attribute definitions (aka structure of the dataset).
That's how the classifier can access the class attribute.
When saving a classifier Weka also saves the dataset structure in the
serialized file (1st element: model, 2nd element: dataset
header/structure).
You can just load the dataset structure and then move the data from
your prediction dataset into the correct structure (and use a missing
value for the class).

The Weka manual and the wiki contain information on how to the Weka API:
https://waikato.github.io/weka-wiki/using_the_api/

Cheers, Peter
--
Peter Reutemann
Dept. of Computer Science
University of Waikato, NZ
+64 (7) 577-5304
http://www.cms.waikato.ac.nz/~fracpete/
http://www.data-mining.co.nz/
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Tom
Reply | Threaded
Open this post in threaded view
|

Re: RandomForest predictions

Tom
Hi,

Cool, so I would just need to find out how to get to the saved dataset structure.

I've read and already implemented https://waikato.github.io/weka-wiki/serialization/ but it doesn't actually help me with the dataset structure? It just tells me to store:
RandomForest rf = new RandomForest();
rf.buildClassifier(train);
weka.core.SerializationHelper.write(path, rf);
But then loading that with weka.core.SerializationHelper.read(path) (and casting of course) doesn't actually help me? I just get the classifier back, and when I ask it to distributionForInstance I get the earlier error.
Or do you mean this little section:
In order to read serialized models that contain the header information as well, you can use the readAll method of the weka.core.SerializationHelper. For serializing models with their datasets, use writeAll.
But the manual doesn't tell me how to actually use the header information to reconstruct an empty Instances object, where I can then add my loaded dataset into. I suppose it's one by one after some initialization?

Sorry for these newbie questions, but I'm just having a hard time getting the details to work, and prediction is the last (but necessary) step in the process.

Best regards,
  Tom

On Thu, Mar 11, 2021 at 9:35 PM Peter Reutemann <[hidden email]> wrote:
> Okay, so to do that, I load my prediction dataset, then add a fake "class" column to the instances somehow, and give them all the value of "missing value"? How then would the classifier know that there are 2 classes in my case, or 3 in the next case?

Every Instance object has a reference to an Instances object, which
contains the Attribute definitions (aka structure of the dataset).
That's how the classifier can access the class attribute.
When saving a classifier Weka also saves the dataset structure in the
serialized file (1st element: model, 2nd element: dataset
header/structure).
You can just load the dataset structure and then move the data from
your prediction dataset into the correct structure (and use a missing
value for the class).

The Weka manual and the wiki contain information on how to the Weka API:
https://waikato.github.io/weka-wiki/using_the_api/

Cheers, Peter
--
Peter Reutemann
Dept. of Computer Science
University of Waikato, NZ
+64 (7) 577-5304
http://www.cms.waikato.ac.nz/~fracpete/
http://www.data-mining.co.nz/
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Tom
Reply | Threaded
Open this post in threaded view
|

Re: RandomForest predictions

Tom
Hi,

New day, I've read some more, but I still haven't found the relevant documentation:

https://github.com/Waikato/weka-wiki/blob/master/docs/making_predictions.md doesn't tell me much about how to transform my dataset to be compatible based on a saved model
https://github.com/Waikato/weka-wiki/blob/master/docs/use_weka_in_your_java_code.md#classifying-instances similar, it assumes the unlabeled dataset already has the column, but doesn't specify how to tell it which classes exist

Best regards,
  Tom

On Thu, Mar 11, 2021 at 10:00 PM Tom <[hidden email]> wrote:
Hi,

Cool, so I would just need to find out how to get to the saved dataset structure.

I've read and already implemented https://waikato.github.io/weka-wiki/serialization/ but it doesn't actually help me with the dataset structure? It just tells me to store:
RandomForest rf = new RandomForest();
rf.buildClassifier(train);
weka.core.SerializationHelper.write(path, rf);
But then loading that with weka.core.SerializationHelper.read(path) (and casting of course) doesn't actually help me? I just get the classifier back, and when I ask it to distributionForInstance I get the earlier error.
Or do you mean this little section:
In order to read serialized models that contain the header information as well, you can use the readAll method of the weka.core.SerializationHelper. For serializing models with their datasets, use writeAll.
But the manual doesn't tell me how to actually use the header information to reconstruct an empty Instances object, where I can then add my loaded dataset into. I suppose it's one by one after some initialization?

Sorry for these newbie questions, but I'm just having a hard time getting the details to work, and prediction is the last (but necessary) step in the process.

Best regards,
  Tom

On Thu, Mar 11, 2021 at 9:35 PM Peter Reutemann <[hidden email]> wrote:
> Okay, so to do that, I load my prediction dataset, then add a fake "class" column to the instances somehow, and give them all the value of "missing value"? How then would the classifier know that there are 2 classes in my case, or 3 in the next case?

Every Instance object has a reference to an Instances object, which
contains the Attribute definitions (aka structure of the dataset).
That's how the classifier can access the class attribute.
When saving a classifier Weka also saves the dataset structure in the
serialized file (1st element: model, 2nd element: dataset
header/structure).
You can just load the dataset structure and then move the data from
your prediction dataset into the correct structure (and use a missing
value for the class).

The Weka manual and the wiki contain information on how to the Weka API:
https://waikato.github.io/weka-wiki/using_the_api/

Cheers, Peter
--
Peter Reutemann
Dept. of Computer Science
University of Waikato, NZ
+64 (7) 577-5304
http://www.cms.waikato.ac.nz/~fracpete/
http://www.data-mining.co.nz/
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: RandomForest predictions

Peter Reutemann
In reply to this post by Tom
> Cool, so I would just need to find out how to get to the saved dataset structure.
>
> I've read and already implemented https://waikato.github.io/weka-wiki/serialization/ but it doesn't actually help me with the dataset structure? It just tells me to store:
>
> RandomForest rf = new RandomForest();
> rf.buildClassifier(train);
>
> weka.core.SerializationHelper.write(path, rf);
>
> But then loading that with weka.core.SerializationHelper.read(path) (and casting of course) doesn't actually help me? I just get the classifier back, and when I ask it to distributionForInstance I get the earlier error.
>
> Or do you mean this little section:
>
> In order to read serialized models that contain the header information as well, you can use the readAll method of the weka.core.SerializationHelper. For serializing models with their datasets, use writeAll.

See Weka manual PDF that comes with your Weka distribution, IV
Appendix, section "18.2 Serialization". Has examples for
readAll/writeAll applied to classifiers.

> But the manual doesn't tell me how to actually use the header information to reconstruct an empty Instances object, where I can then add my loaded dataset into. I suppose it's one by one after some initialization?

The header information *is* an empty weka.core.Instances object.
Use the "Instances(Instances dataset, int capacity)" constructor to
create a copy of it:
https://weka.sourceforge.io/doc.dev/weka/core/Instances.html#Instances-weka.core.Instances-int-

> Sorry for these newbie questions, but I'm just having a hard time getting the details to work, and prediction is the last (but necessary) step in the process.

For creating rows (aka Instance) objects, have a look through this example:
https://waikato.github.io/weka-wiki/formats_and_processing/creating_arff_file/

You basically create a new double array, fill it with values (or
Utils.missingValue()) and then instantiate a DenseInstance (or
SparseInstance) object with it.
https://weka.sourceforge.io/doc.dev/weka/core/DenseInstance.html
https://weka.sourceforge.io/doc.dev/weka/core/SparseInstance.html
And then don't forget to reference the header information by setting
the Instances reference.

BTW Javadoc comes with your Weka distribution as well.

Cheers, Peter
--
Peter Reutemann
Dept. of Computer Science
University of Waikato, NZ
+64 (7) 577-5304
http://www.cms.waikato.ac.nz/~fracpete/
http://www.data-mining.co.nz/
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html