Train a model using multiple arff files

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Train a model using multiple arff files

Sreynoch.Soung
Hello all,

Is there any ways to train only one model by using multiple arff files (all files has the same structure) ?
I have multi arff files which are generated. Since, the different arff file is represented different context that's why I keep it in the separated file.
I'm not sure that incremental learning works in this case since it learns and updates the model by instance.

Thank you in advanced for your precious response.

Best regards,
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Train a model using multiple arff files

Eibe Frank-2
Administrator
One way to combine your models would be the Vote classifier, which can load base classifiers from files.

Alternatively, you could create a single dataset and the apply Vote in conjunction with multiple FilteredClassifier objects that filter out appropriate subsets of data using the RemoveRange filter.

A more flexible approach would be to introduce an additional nominal attribute indicating dataset membership in the combined data and then apply FilteredClassifier with RemoveWithValues to filter out appropriate subsets of data. The additional attribute can be removed with the Remove filter before the actual base classifier is applied.

Cheers,
Eibe

> On 21 May 2017, at 01:01, Sreynoch.Soung <[hidden email]> wrote:
>
> Hello all,
>
> Is there any ways to train only one model by using multiple arff files (all
> files has the same structure) ?
> I have multi arff files which are generated. Since, the different arff file
> is represented different context that's why I keep it in the separated file.
> I'm not sure that incremental learning works in this case since it learns
> and updates the model by instance.
>
> Thank you in advanced for your precious response.
>
> Best regards,
>
>
>
>
> --
> View this message in context: http://weka.8497.n7.nabble.com/Train-a-model-using-multiple-arff-files-tp40686.html
> Sent from the WEKA mailing list archive at Nabble.com.
> _______________________________________________
> Wekalist mailing list
> Send posts to: [hidden email]
> List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
> List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Train a model using multiple arff files

Alexander Osherenko
The same problem is addressed by mulimodality fusion -- what is the approach to improve classification when extracting data from different sources (modalities). Every ARFF represents a different modality.

I wonder:
1. In two-stages classification: Is it possible to store probabilities of different outcomes and instance numbers in order to use them in further dataset that uses calculated probabilities as attributes?
2. In Vote, there is the CombinationRule parameter that represents how votes or probabilities can be combined, for example, majority or average of probabilities. Can you imagine that CombinationRule would be custom so that a user can specify it dynamically, for instance, in Python? Maybe, it is the question in the python-wrapper forum.

Best, Alexander

2017-05-21 2:34 GMT+01:00 Eibe Frank <[hidden email]>:
One way to combine your models would be the Vote classifier, which can load base classifiers from files.

Alternatively, you could create a single dataset and the apply Vote in conjunction with multiple FilteredClassifier objects that filter out appropriate subsets of data using the RemoveRange filter.

A more flexible approach would be to introduce an additional nominal attribute indicating dataset membership in the combined data and then apply FilteredClassifier with RemoveWithValues to filter out appropriate subsets of data. The additional attribute can be removed with the Remove filter before the actual base classifier is applied.

Cheers,
Eibe

> On 21 May 2017, at 01:01, Sreynoch.Soung <[hidden email]> wrote:
>
> Hello all,
>
> Is there any ways to train only one model by using multiple arff files (all
> files has the same structure) ?
> I have multi arff files which are generated. Since, the different arff file
> is represented different context that's why I keep it in the separated file.
> I'm not sure that incremental learning works in this case since it learns
> and updates the model by instance.
>
> Thank you in advanced for your precious response.
>
> Best regards,
>
>
>
>
> --
> View this message in context: http://weka.8497.n7.nabble.com/Train-a-model-using-multiple-arff-files-tp40686.html
> Sent from the WEKA mailing list archive at Nabble.com.
> _______________________________________________
> Wekalist mailing list
> Send posts to: [hidden email]
> List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
> List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html


_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Train a model using multiple arff files

Eibe Frank-2
Administrator

> On 21 May 2017, at 19:56, Alexander Osherenko <[hidden email]> wrote:
>
> The same problem is addressed by mulimodality fusion -- what is the approach to improve classification when extracting data from different sources (modalities). Every ARFF represents a different modality.
>
> I wonder:
> 1. In two-stages classification: Is it possible to store probabilities of different outcomes and instance numbers in order to use them in further dataset that uses calculated probabilities as attributes?

There is the AddClassification filter, which can optionally append the class probability distribution as extra attributes.

> 2. In Vote, there is the CombinationRule parameter that represents how votes or probabilities can be combined, for example, majority or average of probabilities. Can you imagine that CombinationRule would be custom so that a user can specify it dynamically, for instance, in Python? Maybe, it is the question in the python-wrapper forum.

The AddClassification filter can also read a serialized classifier from a file. It may be possible to use two AddClassification filters, in conjunction with the PartitionedMultiFilter, to create a dataset that has class probability distributions from both classifier. Then these could be combined based on a user-defined expression using AddExpression. Finally, you could perform logistic regression (or linear regression, if the target attribute is numeric) on the resulting attribute to predict the actual class.

However, it might be easier to just use JythonClassifier or GroovyClassifier to write a script for a version of Vote that has an appropriate combination rule:

https://weka.wikispaces.com/Using+WEKA+from+Jython
https://weka.wikispaces.com/Using+WEKA+from+Groovy

The simplest example of a GroovyClassifier is probably the following one, which implements a version of linear regression by subclassing the LinearRegression class:

http://weka.8497.n7.nabble.com/intercept-0-in-linear-regression-tp35792p35803.html

Cheers,
Eibe

_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Train a model using multiple arff files

Alexander Osherenko
Good to know, thanks. Actually, I realized these two fusion schemes using Jython a couple of years ago but they were horrible since I had to do everything myself.

2017-05-21 10:30 GMT+01:00 Eibe Frank <[hidden email]>:

> On 21 May 2017, at 19:56, Alexander Osherenko <[hidden email]> wrote:
>
> The same problem is addressed by mulimodality fusion -- what is the approach to improve classification when extracting data from different sources (modalities). Every ARFF represents a different modality.
>
> I wonder:
> 1. In two-stages classification: Is it possible to store probabilities of different outcomes and instance numbers in order to use them in further dataset that uses calculated probabilities as attributes?

There is the AddClassification filter, which can optionally append the class probability distribution as extra attributes.

> 2. In Vote, there is the CombinationRule parameter that represents how votes or probabilities can be combined, for example, majority or average of probabilities. Can you imagine that CombinationRule would be custom so that a user can specify it dynamically, for instance, in Python? Maybe, it is the question in the python-wrapper forum.

The AddClassification filter can also read a serialized classifier from a file. It may be possible to use two AddClassification filters, in conjunction with the PartitionedMultiFilter, to create a dataset that has class probability distributions from both classifier. Then these could be combined based on a user-defined expression using AddExpression. Finally, you could perform logistic regression (or linear regression, if the target attribute is numeric) on the resulting attribute to predict the actual class.

However, it might be easier to just use JythonClassifier or GroovyClassifier to write a script for a version of Vote that has an appropriate combination rule:

https://weka.wikispaces.com/Using+WEKA+from+Jython
https://weka.wikispaces.com/Using+WEKA+from+Groovy

The simplest example of a GroovyClassifier is probably the following one, which implements a version of linear regression by subclassing the LinearRegression class:

http://weka.8497.n7.nabble.com/intercept-0-in-linear-regression-tp35792p35803.html

Cheers,
Eibe

_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html


_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Train a model using multiple arff files

Sreynoch.Soung
In reply to this post by Eibe Frank-2
Firstly, thank you so much for the solution.

However, I've some concerns.
1. I try to use Vote classifier with GridSearch and the based method is Random Forest or any other Tree methods but I got error:
java.beans.IntrospectionException: Method not found: isC

But It works fine if I change the based method to LibSVM.
Can you indicate what's the cause of this error, please ?

2. I try to use Vote classifier with MultiSearch and the based method is any Tree methods or LibSVM but I got error:
java.lang.NoClassDefFoundError: no/uib/cipr/matrix/Matrix

=> I found the same problem with linear regression of this post http://weka.8497.n7.nabble.com/Linear-Regression-Error-td39721.htm
and they solved by adding some .jar file
I try to follow them but I get other errors:
AVERTISSEMENT: Failed to load implementation from: com.github.fommil.netlib.NativeRefARPACK
java.beans.IntrospectionException: Method not found: isConfidenceFactor


Could you please help me out.
Best regards,
Sreynoch.




Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Train a model using multiple arff files

Eibe Frank-2
Administrator

> On 28 May 2017, at 06:40, Sreynoch.Soung <[hidden email]> wrote:
>
> 1. I try to use Vote classifier with GridSearch and the based method is
> Random Forest or any other Tree methods but I got error:
> /java.beans.IntrospectionException: Method not found: isC/
>
> But It works fine if I change the based method to LibSVM.
> Can you indicate what's the cause of this error, please ?

GridSearch requires you to specify the names of the two properties of the base classifier that correspond to the two parameters of the base classifier you want to optimise. In RandomForest, you could use "maxDepth" and "numFeatures" as the X and Y properties respectively.

Here is an example configuration (test on WEKA 3.9.2-SNAPSHOT):

weka.classifiers.meta.GridSearch -E ACC -y-property numFeatures -y-min 1.0 -y-max 10.0 -y-step 1.0 -y-base 10.0 -y-expression I -x-property maxDepth -x-min 1.0 -x-max 10.0 -x-step 1.0 -x-base 10.0 -x-expression I -sample-size 100.0 -traversal COLUMN-WISE -log-file /Users/eibe -num-slots 1 -S 1 -W weka.classifiers.trees.RandomForest -- -P 100 -I 100 -num-slots 1 -K 0 -M 1.0 -V 0.001 -S 1

> 2. I try to use Vote classifier with MultiSearch and the based method is any
> Tree methods or LibSVM but I got error:
> /java.lang.NoClassDefFoundError: no/uib/cipr/matrix/Matrix/
>
> => I found the same problem with linear regression of this post
> http://weka.8497.n7.nabble.com/Linear-Regression-Error-td39721.htm
> <http://weka.8497.n7.nabble.com/Linear-Regression-Error-td39721.htm>  
> and they solved by adding some .jar file
> I try to follow them but I get other errors:
> /AVERTISSEMENT: Failed to load implementation from:
> com.github.fommil.netlib.NativeRefARPACK

You can ignore this warning. It means WEKA could not find a system-optimized BLAS implementation, which will just affect speed of execution.

> java.beans.IntrospectionException: Method not found: isConfidenceFactor/

Does the base classifier you use for MultiSearch have a property called "confidenceFactor"? This exception indicates that it doesn't. You need to set the parameter names in MultiSearch appropriately for the base classifier you use (see above answer regarding GridSearch).

Cheers,
Eibe

_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Train a model using multiple arff files

Sreynoch.Soung
Thank you so much for your quick response and detail explanation, now I understand the problems.

But right now, I found that it takes so long to learn a model by using Vote classifier with Random Forest as a based method.
I've 35,649,138 examples and 151 attributes for each. It's been learning for almost 30 hours but there's still result.
I'm wondering how long is it supposed to finish approximately.  

Could you please tell me is there any Parallel way/function for this case.
I tried to find and I found Weka Parallel but I couldn't get more information about it and some links are out of date.

Thank you in advanced for your answers.

Best regards,
Sreynoch.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Train a model using multiple arff files

Eibe Frank-2
Administrator
Wow, that’s a lot of data! Have you checked how much heap space WEKA is using? You can use the jvisualvm command to inspect a running Java virtual machine. If it gets close to the maximum heap size you have specified, the JVM might start spending a lot of time on garbage collection.

Do you use multiple versions of RandomForest or why do you use Vote? Using Vote with a single RandomForest does not make sense.

The implementation of RandomForest is multi-threaded. You can use the -num-slots parameter to specify the number of threads to use (my own heuristic is: number of actual CPU cores + 1, on a four-core CPU).

There are also the packages for distributed computing using Hadoop or Spark:

  http://markahall.blogspot.co.nz/2013/10/weka-and-hadoop-part-1.html

RandomForest extends Bagging so it is Aggregateable and can be distributed across multiple computers.

Given m attributes and n instances, a rough bound on time complexity for RandomForest should be O(log(m) n log(n)^2) if you use the default heuristic for setting the number of attributes to evaluate at each node.

I’d try it on smaller subsets to get an idea of how long it might take on your full dataset.

I assume you are doing a single train-test split (“percentage split”)? Make sure you turn off output of the model when you run WEKA, otherwise RandomForest will be run twice: once on the training split and once on the full dataset (to build the model to be output)!

Cheers,
Eibe

> On 29/05/2017, at 4:57 AM, Sreynoch.Soung <[hidden email]> wrote:
>
> Thank you so much for your quick response and detail explanation, now I
> understand the problems.
>
> But right now, I found that it takes so long to learn a model by using Vote
> classifier with Random Forest as a based method.
> I've 35,649,138 examples and 151 attributes for each. It's been learning for
> almost 30 hours but there's still result.
> I'm wondering how long is it supposed to finish approximately.  
>
> Could you please tell me is there any Parallel way/function for this case.
> I tried to find and I found Weka Parallel but I couldn't get more
> information about it and some links are out of date.
>
> Thank you in advanced for your answers.
>
> Best regards,
> Sreynoch.
>
>
>
> --
> View this message in context: http://weka.8497.n7.nabble.com/Train-a-model-using-multiple-arff-files-tp40686p40773.html
> Sent from the WEKA mailing list archive at Nabble.com.
> _______________________________________________
> Wekalist mailing list
> Send posts to: [hidden email]
> List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
> List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Loading...