erratic behaviour hindering experimental replicability

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

erratic behaviour hindering experimental replicability

Marina Santini
Hi, 

my co-worker and I are running the same sets of weka experiments, on the same datasets, with the same algorithms and the same filters, but on different laptops.
This is to ensure and to monitor the experimental replicability and reproducibility of the results, an important aspect in our field. 

To make a long story short, we are experiencing erratic behaviours. To make just an example (I can provide more, if needed),  when we open the functions algorithms (under the classify tab) on the same dataset, on my laptop I can see that all the functions classifiers are available (see figure 1), while on my co-worker's laptop only SMO is available and all the other classifiers are greyed out (Fig. 2). 

Fig.1.
image.png


Fig. 2. 

image.png


We are both using windows 10 and the same weka version. What kind of troubleshooting do we need to go through?.

Any suggestion or enlightenment will be appreciated. 

Thanx in advance.

Cheers, Marina

_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: erratic behaviour hindering experimental replicability

Eibe Frank-2
Administrator
What’s greyed out depends on the properties of the dataset has been loaded and the attribute that has been chosen as the class attribute.

Assuming you have both loaded exactly the same data, and chosen the same class attribute, the lists should be the same.

If that’s not the case, try renaming the folder named “wekafiles” so that all packages are temporarily invisible to WEKA. It is possible that a package plays havoc with the GUI. The usual culprit is the Auto-WEKA package.

Cheers,
Eibe

> On 5/02/2019, at 11:39 PM, Marina Santini <[hidden email]> wrote:
>
> Hi,
>
> my co-worker and I are running the same sets of weka experiments, on the same datasets, with the same algorithms and the same filters, but on different laptops.
> This is to ensure and to monitor the experimental replicability and reproducibility of the results, an important aspect in our field.
>
> To make a long story short, we are experiencing erratic behaviours. To make just an example (I can provide more, if needed),  when we open the functions algorithms (under the classify tab) on the same dataset, on my laptop I can see that all the functions classifiers are available (see figure 1), while on my co-worker's laptop only SMO is available and all the other classifiers are greyed out (Fig. 2).
>
> Fig.1.
> <image.png>
>
>
> Fig. 2.
>
> <image.png>
>
>
> We are both using windows 10 and the same weka version. What kind of troubleshooting do we need to go through?.
>
> Any suggestion or enlightenment will be appreciated.
>
> Thanx in advance.
>
> Cheers, Marina
> _______________________________________________
> Wekalist mailing list
> Send posts to: [hidden email]
> To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
> List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: erratic behaviour hindering experimental replicability

Marina Santini
Hi Eibe, 

thanks for your reply. As a matter of fact, I have autoweka installed while my co-worker probably not. We will look into this. 

Thanks again. 

Cheers, Marina

On Tue, 5 Feb 2019 at 23:59, Eibe Frank <[hidden email]> wrote:
What’s greyed out depends on the properties of the dataset has been loaded and the attribute that has been chosen as the class attribute.

Assuming you have both loaded exactly the same data, and chosen the same class attribute, the lists should be the same.

If that’s not the case, try renaming the folder named “wekafiles” so that all packages are temporarily invisible to WEKA. It is possible that a package plays havoc with the GUI. The usual culprit is the Auto-WEKA package.

Cheers,
Eibe

> On 5/02/2019, at 11:39 PM, Marina Santini <[hidden email]> wrote:
>
> Hi,
>
> my co-worker and I are running the same sets of weka experiments, on the same datasets, with the same algorithms and the same filters, but on different laptops.
> This is to ensure and to monitor the experimental replicability and reproducibility of the results, an important aspect in our field.
>
> To make a long story short, we are experiencing erratic behaviours. To make just an example (I can provide more, if needed),  when we open the functions algorithms (under the classify tab) on the same dataset, on my laptop I can see that all the functions classifiers are available (see figure 1), while on my co-worker's laptop only SMO is available and all the other classifiers are greyed out (Fig. 2).
>
> Fig.1.
> <image.png>
>
>
> Fig. 2.
>
> <image.png>
>
>
> We are both using windows 10 and the same weka version. What kind of troubleshooting do we need to go through?.
>
> Any suggestion or enlightenment will be appreciated.
>
> Thanx in advance.
>
> Cheers, Marina
> _______________________________________________
> Wekalist mailing list
> Send posts to: [hidden email]
> To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
> List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html


_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: erratic behaviour hindering experimental replicability

Marina Santini
Thanks for your detailed answer, Eibe. 

I will run the classifiers with the parameter settings you suggest. I will let you know as soon as I have some results. 

Have a nice start of the week.

Cheers, Marina

On Sun, 10 Feb 2019 at 00:59, Eibe Frank <[hidden email]> wrote:

Hi Marina,

 

Thanks for sending me the data. Very interesting. It turns out that the reason it takes “forever” is the small number of hidden units specified by default in MLPClassifier: only two. Effectively the data has to be compressed down to two dimensions in a meaningful way before being classified by a set of linear classifiers (the output layer). It seems that this is an extremely difficult task. If you run MLPClassifier in debug mode, you can monitor the optimisation process if you start WEKA with a console open. After a long time, it actually terminates after 10,000 iterations without success.

 

The solution in this case is to use *more* hidden units. With 10 hidden units, MLPClassifier produced a solution after 250 iterations. I turned on conjugate gradient descent optimisation and used multi-threading to get this results. (I set the pool size and the number of threads to five each). See below why conjugate gradient descent is necessary.

 

Dl4j uses highly-optimised native code rather than pure Java to do the heavy-duty mathematical work, so  it can be much faster (especially if you use a GPU rather than a CPU). Another difference is that Dl4jMlpClassifier runs the optimisation process (that trains the network) for a user-specified number of iterations/epochs, whereas MLPClassifier runs iterations until convergence of the optimisation process (i.e., until the error on the training data cannot be improved any further), which can take a long time (and may not actually be necessary).

 

Alos, MLPClassifier uses the BFGS optimisation algorithm by default, which is computationally inefficient for high-dimensional problems such as a text classification problem. It takes a long time to converge and consumes a lot of memory. Turn on the conjugate gradient descent option in MLPClassifier (useCGD) for better runtime when the dimensionality of the data is high.

 

I can’t even run the default MLPClassifier on your data with default settings for StringToWordVector filter (or Logistic for that matter, which also uses BFGS optimisation): 12GB of heap space is insufficient and WEKA crashes because it runs out of memory.

 

Why don’t we always use conjugate gradient descent in MLPClassifier? Because BFGS can be much faster on low-dimensional problems than conjugate gradient descent.

 

This would actually be very useful info to have on the mailing list. Perhaps you could send a reply to the list with this info if this works for you?

 

Cheers,

Eibe

 

From: [hidden email]
Sent: Saturday, 9 February 2019 3:29 AM
To: [hidden email]; [hidden email]
Cc: [hidden email]
Subject: Re: [Wekalist] erratic behaviour hindering experimental replicability

 

Hi Eibe, 

 

we are still experiencing troubles with the combination of FilteredClassifier: MlpClassifier+StringToWordVector filter. 

 

My coworker Benjamin and I have installed different versions of weka on different computers, but MlpClassifier remains stuck on "Building model on training data" in most of them. It does not crash but it runs like in an infinite loop. 

 

We are using the dataset that I attach to this email. I send this email only to you and not to the list, just because I am not sure about the license requirements of the corpus the dataset was derived from. It is a public and free corpus, but it has a licence. 

 

We made the decision not to use the standard MlpClassifier in our experiments. Instead, we will use, the MLP classifier in the weka deep learning package. 

 

I hope that the attached dataset will help the weka team to fix the problem with MlpClassifier. 

 

Have a nice weekend

 

Marina and Benjamin

 

On Wed, 6 Feb 2019 at 09:02, Marina Santini <[hidden email]> wrote:

Hi Eibe, 

 

thanks for your reply. As a matter of fact, I have autoweka installed while my co-worker probably not. We will look into this. 

 

Thanks again. 

 

Cheers, Marina

 

On Tue, 5 Feb 2019 at 23:59, Eibe Frank <[hidden email]> wrote:

What’s greyed out depends on the properties of the dataset has been loaded and the attribute that has been chosen as the class attribute.

Assuming you have both loaded exactly the same data, and chosen the same class attribute, the lists should be the same.

If that’s not the case, try renaming the folder named “wekafiles” so that all packages are temporarily invisible to WEKA. It is possible that a package plays havoc with the GUI. The usual culprit is the Auto-WEKA package.

Cheers,
Eibe


> On 5/02/2019, at 11:39 PM, Marina Santini <[hidden email]> wrote:
>
> Hi,
>
> my co-worker and I are running the same sets of weka experiments, on the same datasets, with the same algorithms and the same filters, but on different laptops.
> This is to ensure and to monitor the experimental replicability and reproducibility of the results, an important aspect in our field.
>
> To make a long story short, we are experiencing erratic behaviours. To make just an example (I can provide more, if needed),  when we open the functions algorithms (under the classify tab) on the same dataset, on my laptop I can see that all the functions classifiers are available (see figure 1), while on my co-worker's laptop only SMO is available and all the other classifiers are greyed out (Fig. 2).
>
> Fig.1.
> <image.png>
>
>
> Fig. 2.
>
> <image.png>
>
>
> We are both using windows 10 and the same weka version. What kind of troubleshooting do we need to go through?.
>
> Any suggestion or enlightenment will be appreciated.
>
> Thanx in advance.
>
> Cheers, Marina
> _______________________________________________
> Wekalist mailing list
> Send posts to: [hidden email]
> To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
> List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

 

_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: erratic behaviour hindering experimental replicability

Marina Santini
Hi Eibe, 

I ran MLPClassifier together with the StringToWordVector filter via FilteredClassifier with the settings you suggested. The actual scheme is pasted below:

weka.classifiers.meta.FilteredClassifier -F "weka.filters.unsupervised.attribute.StringToWordVector -R first-last -W 1000 -prune-rate -1.0 -N 0 -stemmer weka.core.stemmers.NullStemmer -stopwords-handler weka.core.stopwords.Null -M 1 -tokenizer \"weka.core.tokenizers.WordTokenizer -delimiters \\\" \\\\r\\\\n\\\\t.,;:\\\\\\\'\\\\\\\"()?!\\\"\"" -S 1 -W weka.classifiers.functions.MLPClassifier -- -N 10 -R 0.01 -O 1.0E-6 -G -P 5 -E 5 -S 1 -output-debug-info -L weka.classifiers.functions.loss.SquaredError -A weka.classifiers.functions.activation.ApproximateSigmoid

It took about 20 hours to complete, but in the end the performance was respectful. I attach the whole output. 

This is scheme I used for Dl4jMlpClassifier. It took about a couple of minutes to complete.
weka.classifiers.meta.FilteredClassifier -F "weka.filters.unsupervised.attribute.StringToWordVector -R first-last -W 1000 -prune-rate -1.0 -N 0 -stemmer weka.core.stemmers.NullStemmer -stopwords-handler weka.core.stopwords.Null -M 1 -tokenizer \"weka.core.tokenizers.WordTokenizer -delimiters \\\" \\\\r\\\\n\\\\t.,;:\\\\\\\'\\\\\\\"()?!\\\"\"" -S 1 -W weka.classifiers.functions.Dl4jMlpClassifier -- -S 1 -cache-mode MEMORY -early-stopping "weka.dl4j.earlystopping.EarlyStopping -maxEpochsNoImprovement 0 -valPercentage 0.0" -normalization "Standardize training data" -iterator "weka.dl4j.iterators.instance.DefaultInstanceIterator -bs 1" -iteration-listener "weka.dl4j.listener.EpochListener -eval true -n 5" -layer "weka.dl4j.layers.OutputLayer -lossFn \"weka.dl4j.lossfunctions.LossMCXENT \" -nOut 0 -activation \"weka.dl4j.activations.ActivationSoftmax \" -name \"Output layer\"" -logConfig "weka.core.LogConfiguration -append true -dl4jLogLevel WARN -logFile /Users/marinasantini/wekafiles/wekaDeeplearning4j.log -nd4jLogLevel INFO -wekaDl4jLogLevel INFO" -config "weka.dl4j.NeuralNetConfiguration -biasInit 0.0 -biasUpdater \"weka.dl4j.updater.Sgd -lr 0.001 -lrSchedule \\\"weka.dl4j.schedules.ConstantSchedule -scheduleType EPOCH\\\"\" -dist \"weka.dl4j.distribution.Disabled \" -dropout \"weka.dl4j.dropout.Disabled \" -gradientNormalization None -gradNormThreshold 1.0 -l1 NaN -l2 NaN -minimize -algorithm STOCHASTIC_GRADIENT_DESCENT -updater \"weka.dl4j.updater.Adam -beta1MeanDecay 0.9 -beta2VarDecay 0.999 -epsilon 1.0E-8 -lr 0.001 -lrSchedule \\\"weka.dl4j.schedules.ConstantSchedule -scheduleType EPOCH\\\"\" -weightInit XAVIER -weightNoise \"weka.dl4j.weightnoise.Disabled \"" -numEpochs 10 -queueSize 0 -zooModel "weka.dl4j.zoo.CustomNet "

Here is a summary table: 

FilteredClassifier

+

StringToWordVector

Weighted Avg. F-measure

Run time

Parameter Settings

smo

0.732

about 1min

standard

MLPClassifier

0.748

about 20 hours

customized

Dl4jMlpClassifier

0.706

about 2 minutes

standard



I ran these experiments on a Mac running weka.3.8.2.
(please note that weka mac versions 3.8.3 and 3.9.3 crash out).

Both MLPClassifier and Dl4jMlpClassifier belong to the WekaDeeplearning4j package.

The MultilayerPerceptron classifier that belongs to the basic weka installation does not work via the FilterClassifier (possibly for the same reasons underlying the MLPClassifier behaviour with standard settings). 

What is the difference between classifiers in blue and classifiers in black? (I know that grey out classifiers means that they cannot be run on the loaded dataset)

image.png

Cheers, Marina


On Mon, 11 Feb 2019 at 09:58, Marina Santini <[hidden email]> wrote:
Thanks for your detailed answer, Eibe. 

I will run the classifiers with the parameter settings you suggest. I will let you know as soon as I have some results. 

Have a nice start of the week.

Cheers, Marina

On Sun, 10 Feb 2019 at 00:59, Eibe Frank <[hidden email]> wrote:

Hi Marina,

 

Thanks for sending me the data. Very interesting. It turns out that the reason it takes “forever” is the small number of hidden units specified by default in MLPClassifier: only two. Effectively the data has to be compressed down to two dimensions in a meaningful way before being classified by a set of linear classifiers (the output layer). It seems that this is an extremely difficult task. If you run MLPClassifier in debug mode, you can monitor the optimisation process if you start WEKA with a console open. After a long time, it actually terminates after 10,000 iterations without success.

 

The solution in this case is to use *more* hidden units. With 10 hidden units, MLPClassifier produced a solution after 250 iterations. I turned on conjugate gradient descent optimisation and used multi-threading to get this results. (I set the pool size and the number of threads to five each). See below why conjugate gradient descent is necessary.

 

Dl4j uses highly-optimised native code rather than pure Java to do the heavy-duty mathematical work, so  it can be much faster (especially if you use a GPU rather than a CPU). Another difference is that Dl4jMlpClassifier runs the optimisation process (that trains the network) for a user-specified number of iterations/epochs, whereas MLPClassifier runs iterations until convergence of the optimisation process (i.e., until the error on the training data cannot be improved any further), which can take a long time (and may not actually be necessary).

 

Alos, MLPClassifier uses the BFGS optimisation algorithm by default, which is computationally inefficient for high-dimensional problems such as a text classification problem. It takes a long time to converge and consumes a lot of memory. Turn on the conjugate gradient descent option in MLPClassifier (useCGD) for better runtime when the dimensionality of the data is high.

 

I can’t even run the default MLPClassifier on your data with default settings for StringToWordVector filter (or Logistic for that matter, which also uses BFGS optimisation): 12GB of heap space is insufficient and WEKA crashes because it runs out of memory.

 

Why don’t we always use conjugate gradient descent in MLPClassifier? Because BFGS can be much faster on low-dimensional problems than conjugate gradient descent.

 

This would actually be very useful info to have on the mailing list. Perhaps you could send a reply to the list with this info if this works for you?

 

Cheers,

Eibe

 

From: [hidden email]
Sent: Saturday, 9 February 2019 3:29 AM
To: [hidden email]; [hidden email]
Cc: [hidden email]
Subject: Re: [Wekalist] erratic behaviour hindering experimental replicability

 

Hi Eibe, 

 

we are still experiencing troubles with the combination of FilteredClassifier: MlpClassifier+StringToWordVector filter. 

 

My coworker Benjamin and I have installed different versions of weka on different computers, but MlpClassifier remains stuck on "Building model on training data" in most of them. It does not crash but it runs like in an infinite loop. 

 

We are using the dataset that I attach to this email. I send this email only to you and not to the list, just because I am not sure about the license requirements of the corpus the dataset was derived from. It is a public and free corpus, but it has a licence. 

 

We made the decision not to use the standard MlpClassifier in our experiments. Instead, we will use, the MLP classifier in the weka deep learning package. 

 

I hope that the attached dataset will help the weka team to fix the problem with MlpClassifier. 

 

Have a nice weekend

 

Marina and Benjamin

 

On Wed, 6 Feb 2019 at 09:02, Marina Santini <[hidden email]> wrote:

Hi Eibe, 

 

thanks for your reply. As a matter of fact, I have autoweka installed while my co-worker probably not. We will look into this. 

 

Thanks again. 

 

Cheers, Marina

 

On Tue, 5 Feb 2019 at 23:59, Eibe Frank <[hidden email]> wrote:

What’s greyed out depends on the properties of the dataset has been loaded and the attribute that has been chosen as the class attribute.

Assuming you have both loaded exactly the same data, and chosen the same class attribute, the lists should be the same.

If that’s not the case, try renaming the folder named “wekafiles” so that all packages are temporarily invisible to WEKA. It is possible that a package plays havoc with the GUI. The usual culprit is the Auto-WEKA package.

Cheers,
Eibe


> On 5/02/2019, at 11:39 PM, Marina Santini <[hidden email]> wrote:
>
> Hi,
>
> my co-worker and I are running the same sets of weka experiments, on the same datasets, with the same algorithms and the same filters, but on different laptops.
> This is to ensure and to monitor the experimental replicability and reproducibility of the results, an important aspect in our field.
>
> To make a long story short, we are experiencing erratic behaviours. To make just an example (I can provide more, if needed),  when we open the functions algorithms (under the classify tab) on the same dataset, on my laptop I can see that all the functions classifiers are available (see figure 1), while on my co-worker's laptop only SMO is available and all the other classifiers are greyed out (Fig. 2).
>
> Fig.1.
> <image.png>
>
>
> Fig. 2.
>
> <image.png>
>
>
> We are both using windows 10 and the same weka version. What kind of troubleshooting do we need to go through?.
>
> Any suggestion or enlightenment will be appreciated.
>
> Thanx in advance.
>
> Cheers, Marina
> _______________________________________________
> Wekalist mailing list
> Send posts to: [hidden email]
> To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
> List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

 

_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

output_FilteredClassifier_MLPlcassifier_StringToWordVector_9TextVarieites.txt (1M) Download Attachment