WekaDeeplearning4j for text classification in Swedish

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

WekaDeeplearning4j for text classification in Swedish

Marina Santini
Hi,

I would like to train my own word embeddings using the filters
provided by the WekaDeeplearning4j package (GUI installation). The aim
is text classification of Swedish documents.

The arff file we have prepared has 9 classes and the documents have
been converted into the string format without punctuation and
normalized to lowercase. See example below:

@relation 'Bag-Of-Words_9TextCategories'
@attribute text string
@attribute TextCategories {reportage, editorial, review, hobby,
popular_lore, bio_essay, miscellaneous, scientific_writing,
imaginative_prose}

@data
"för syriens del handlar det om att sovjet på sistone givit tydliga
besked om att också vapenkrediter måste regleras den avgörande
skillnaden mellan konflikter i exempelvis europa [text has been
truncated]",reportage

The documentation
(https://deeplearning.cms.waikato.ac.nz/user-guide/nlp/) says "text
can be interpreted as a sequence of so called tokens, where a token
can be e.g. a character, word, sentence or even a whole document.
These tokens can further be mapped with the help of an embedding into
a vector space defined by the embedding. Therefore, a text document
can be represented as a sequence of vectors. This can be achieved by
using the Cnn/RnnTextEmbeddingInstanceIterator and providing an
embedding that was previously downloaded (e.g. Google's pretrained
News model from here)".

If I understand correctly, for Swedish I have to use pretrained word
embeddings from the Polyglot project
(https://sites.google.com/site/rmyeid/projects/polyglot?authuser=0).

If all this is correct, could you please give me some insights on how
to optimize the parameters of the two deeplearning filters? I would
like to compare the performance of word embeddings with the
performance of stringToWordVectors (with SMO and MLP), but I have not
used deep learning with weka berfore.

Thanks in advance for your help

Cheers, Marina
_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: WekaDeeplearning4j for text classification in Swedish

steven-lang
Hi Marina,

The Polyglot embeddings are saved as Python pickle files, so I just wrote a
script to convert these *.pkl files into *.csv files which can be used by
our Rnn/CnnTextEmbeddingInstanceIterator. I have just updated the NLP
documentation with conversion instructions
(https://deeplearning.cms.waikato.ac.nz/user-guide/nlp/#embeddings) which
should be online in the next 24 hours.

Until then, you can also use the Python script in the attachment. You need
Python 3.3 or higher with Numpy and Pandas installed.
 
polyglot_pkl_to_csv.py
<http://weka.8497.n7.nabble.com/file/t6770/polyglot_pkl_to_csv.py>  

In case you don't have access to a Python environment right now, I have
converted the Swedish embedding for you here:
https://drive.google.com/open?id=1PtIsUHy0fNVpzVqSYDvDtlwB4KMBxROt

The two filters (Dl4jStringToWord2Vec and Dl4jStringToGlove) are for the
case in which you want to learn a new embedding from a given dataset
yourself. But since you specifically mentioned the Polyglot embeddings I
assume that is not what you want to do? Correct me if I'm wrong.

Cheers,
Steven



--
Sent from: http://weka.8497.n7.nabble.com/
_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: WekaDeeplearning4j for text classification in Swedish

Marina Santini
Thanks a lot for your invaluable help, Steven.

Honestly, I would like to test all possible alternatives.

I am currently running a filtered classifier (namely SMO, which is usually much much much faster than any deep learning classifiers) combined with the Dl4jStringToGlove filter (see below), which means that word embeddings are learned from the current dataset. Theoretically, that would be nice, but it seems it takes ages (at least with standard parameters). So if Polyglot's pretrained embeddings will show a competitive performance and a much faster running time, the choice will be easy :-) 

image.png

I will get back with some results.

Thanks a lot again

Bye for now

Marina



On Tue, 5 Feb 2019 at 01:53, steven-lang <[hidden email]> wrote:
Hi Marina,

The Polyglot embeddings are saved as Python pickle files, so I just wrote a
script to convert these *.pkl files into *.csv files which can be used by
our Rnn/CnnTextEmbeddingInstanceIterator. I have just updated the NLP
documentation with conversion instructions
(https://deeplearning.cms.waikato.ac.nz/user-guide/nlp/#embeddings) which
should be online in the next 24 hours.

Until then, you can also use the Python script in the attachment. You need
Python 3.3 or higher with Numpy and Pandas installed.

polyglot_pkl_to_csv.py
<http://weka.8497.n7.nabble.com/file/t6770/polyglot_pkl_to_csv.py

In case you don't have access to a Python environment right now, I have
converted the Swedish embedding for you here:
https://drive.google.com/open?id=1PtIsUHy0fNVpzVqSYDvDtlwB4KMBxROt

The two filters (Dl4jStringToWord2Vec and Dl4jStringToGlove) are for the
case in which you want to learn a new embedding from a given dataset
yourself. But since you specifically mentioned the Polyglot embeddings I
assume that is not what you want to do? Correct me if I'm wrong.

Cheers,
Steven



--
Sent from: http://weka.8497.n7.nabble.com/
_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: WekaDeeplearning4j for text classification in Swedish

Marina Santini
Hi again, 

As I mentioned in my previous email, I was about to run a filtered classifier using smo and Dl4jStringToGlove filter. Unfortunately, I never got results. The combination has been running for 24 hours, never going further than the status you see in the picture that I copy below (the weka bird happily scampering on the RHS bottom corner of the screen). 

My guess is that the process is trapped in a loop somewhere. How can I troubleshoot this problem?

The dataset that I am using is relative small, less than 1 million words unevenly distributed over 1040 records. 

Just to have an idea, what would it be the average time to process a dataset of this size using  a filtered classifier that combines smo and Dl4jStringToGlove filter?  

Thanks in advance for your help. 

Cheers, Marina
image.png

On Tue, 5 Feb 2019 at 11:14, Marina Santini <[hidden email]> wrote:
Thanks a lot for your invaluable help, Steven.

Honestly, I would like to test all possible alternatives.

I am currently running a filtered classifier (namely SMO, which is usually much much much faster than any deep learning classifiers) combined with the Dl4jStringToGlove filter (see below), which means that word embeddings are learned from the current dataset. Theoretically, that would be nice, but it seems it takes ages (at least with standard parameters). So if Polyglot's pretrained embeddings will show a competitive performance and a much faster running time, the choice will be easy :-) 

image.png

I will get back with some results.

Thanks a lot again

Bye for now

Marina



On Tue, 5 Feb 2019 at 01:53, steven-lang <[hidden email]> wrote:
Hi Marina,

The Polyglot embeddings are saved as Python pickle files, so I just wrote a
script to convert these *.pkl files into *.csv files which can be used by
our Rnn/CnnTextEmbeddingInstanceIterator. I have just updated the NLP
documentation with conversion instructions
(https://deeplearning.cms.waikato.ac.nz/user-guide/nlp/#embeddings) which
should be online in the next 24 hours.

Until then, you can also use the Python script in the attachment. You need
Python 3.3 or higher with Numpy and Pandas installed.

polyglot_pkl_to_csv.py
<http://weka.8497.n7.nabble.com/file/t6770/polyglot_pkl_to_csv.py

In case you don't have access to a Python environment right now, I have
converted the Swedish embedding for you here:
https://drive.google.com/open?id=1PtIsUHy0fNVpzVqSYDvDtlwB4KMBxROt

The two filters (Dl4jStringToWord2Vec and Dl4jStringToGlove) are for the
case in which you want to learn a new embedding from a given dataset
yourself. But since you specifically mentioned the Polyglot embeddings I
assume that is not what you want to do? Correct me if I'm wrong.

Cheers,
Steven



--
Sent from: http://weka.8497.n7.nabble.com/
_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: WekaDeeplearning4j for text classification in Swedish

steven-lang
Hi Marina,

I had a more thorough look into this issue and was not able to fit a Glove model as well. 
It turns out that this is an issue with the backend library Deeplearning4j and their Glove model implementation (deadlock during model training as they push the task on a separate thread).
Good news is, that the problem was known and has just been fixed in Deeplearning4j 16 days ago (reference commit: https://github.com/deeplearning4j/deeplearning4j/commit/94c811e9a6e28e15ce17cbc467161028d7f822b1).
This means the Dl4jStringToGlove Filter will be working again as soon as Deeplearning4j publishes their next version.

I'm sorry for the inconvenience. 

The Dl4jStringToWord2Vec filter is not affected by this issue and works as expected.

Cheers,
Steven

On Wed, Feb 6, 2019 at 9:25 AM Marina Santini <[hidden email]> wrote:
Hi again, 

As I mentioned in my previous email, I was about to run a filtered classifier using smo and Dl4jStringToGlove filter. Unfortunately, I never got results. The combination has been running for 24 hours, never going further than the status you see in the picture that I copy below (the weka bird happily scampering on the RHS bottom corner of the screen). 

My guess is that the process is trapped in a loop somewhere. How can I troubleshoot this problem?

The dataset that I am using is relative small, less than 1 million words unevenly distributed over 1040 records. 

Just to have an idea, what would it be the average time to process a dataset of this size using  a filtered classifier that combines smo and Dl4jStringToGlove filter?  

Thanks in advance for your help. 

Cheers, Marina
image.png

On Tue, 5 Feb 2019 at 11:14, Marina Santini <[hidden email]> wrote:
Thanks a lot for your invaluable help, Steven.

Honestly, I would like to test all possible alternatives.

I am currently running a filtered classifier (namely SMO, which is usually much much much faster than any deep learning classifiers) combined with the Dl4jStringToGlove filter (see below), which means that word embeddings are learned from the current dataset. Theoretically, that would be nice, but it seems it takes ages (at least with standard parameters). So if Polyglot's pretrained embeddings will show a competitive performance and a much faster running time, the choice will be easy :-) 

image.png

I will get back with some results.

Thanks a lot again

Bye for now

Marina



On Tue, 5 Feb 2019 at 01:53, steven-lang <[hidden email]> wrote:
Hi Marina,

The Polyglot embeddings are saved as Python pickle files, so I just wrote a
script to convert these *.pkl files into *.csv files which can be used by
our Rnn/CnnTextEmbeddingInstanceIterator. I have just updated the NLP
documentation with conversion instructions
(https://deeplearning.cms.waikato.ac.nz/user-guide/nlp/#embeddings) which
should be online in the next 24 hours.

Until then, you can also use the Python script in the attachment. You need
Python 3.3 or higher with Numpy and Pandas installed.

polyglot_pkl_to_csv.py
<http://weka.8497.n7.nabble.com/file/t6770/polyglot_pkl_to_csv.py

In case you don't have access to a Python environment right now, I have
converted the Swedish embedding for you here:
https://drive.google.com/open?id=1PtIsUHy0fNVpzVqSYDvDtlwB4KMBxROt

The two filters (Dl4jStringToWord2Vec and Dl4jStringToGlove) are for the
case in which you want to learn a new embedding from a given dataset
yourself. But since you specifically mentioned the Polyglot embeddings I
assume that is not what you want to do? Correct me if I'm wrong.

Cheers,
Steven



--
Sent from: http://weka.8497.n7.nabble.com/
_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: WekaDeeplearning4j for text classification in Swedish

steven-lang
Hi Marina,

I think there is a misunderstanding on what the Dl4jStringToWord2Vec filter (and the Dl4jStringToGlove filter) does. We should be more clear in our documentation! 

The Dl4jStringToWord2Vec filter is used to generate a word embedding from a given dataset. That is, the ouptut of the filter is not the transformed input dataset but the word embedding that has been learned from the input - which you can save and use later on (see explanation below).

> To make a long story short: can I use word embeddings to get a classification output similar to the one below, where I used a filtered classifier that combines DI4jMlpClassifier and the stringToWordVector filter?

Yes you can. But keep in mind that using the StringToWordVector filter to transform each input instance is conceptually different from using an Embedding that is learned with Dl4jStringToWord2Vec/Glove:

- The StringToWordVector filter transforms each input instance into a vector of size M (size of your word dictionary)
- Using an embedding with the Cnn/RnnTextEmbeddingInstanceIterator (see below) will transform each input instance into a matrix of size K x M where K is your maximumum sequence length and M is your embedding size. Each token in an input sentence will be transformed via the word embedding into a vector and since your sentence multiple tokens each sentence will then be represented as a matrix

So if you want use do text classification with embeddings with WekaDeeplearning4j you take the following steps:

- Load your dataset as usual in the "Preprocess" pane
- Select a classifier provided by WekaDeeplearning4j. You've got two options here:
- Select the appropriate instance iterator in the classifier:
  - For the RnnSequenceClassifier select RnnTextEmbeddingInstanceIterator
  - For the Dl4jMlpclassifier select CnnTextEmbeddingInstanceIterator
- In the iterator options set "location of word vectors" to your serialized embedding file (either an ARFF file that you have created previously via Dl4jStringToWord2Vec or e.g. the polyglot-sv.csv embedding)
- Setup the appropriate network architecture and the network configuration
- Run the classifier

If you have any questions regarding this process feel free to ask.

Cheers,
Steven

On Wed, Feb 6, 2019 at 1:43 PM Marina Santini <[hidden email]> wrote:
Great! thanks for the quick reply, Steven. 

I confirm that the Dl4jStringToWord2Vec filter works fine. However I assume that it cannot be used with the FilteredClassifier since I get this error: 

image.png

 

 An old thread (https://wekalist.scms.waikato.ac.narkive.com/QsW3ezSP/class-index-not-set-when-using-filteredclassifier-with-attributeselection-filter-and) says that "The AttributeSelectionFilter does not set a class attribute again if the
attribute selection algorithm is an unsupervised one, since those might
remove the class attribute as well. So, for algorithms derived from the
following classes it will never work in the FilteredClassifier, due to
the missing class attribute:
weka.attributeSelection.UnsupervisedSubsetEvaluator
weka.attributeSelection.UnsupervisedAttributeEvaluator

"

I do not know whether this is the case also for Dl4jStringToWord2Vec. I know that this is NOT the case for the unsupervised filter called stringToWordVector, since I use it in the FilteredClassifier regularly.  

I tried to run Dl4jStringToWord2Vec in the Preprocess tab and saved the transformed dataset (attached here).
When I run any classifier on this dataset (including smo and DI4jMlpClassifier) I get the following error: 

image.png


I am following the instructions for the GUI in https://deeplearning.cms.waikato.ac.nz/ but I think I am missing something. But what?

To make a long story short: can I use word embeddings to get a classification output similar to the one below, where I used a filtered classifier that combines DI4jMlpClassifier and the stringToWordVector filter?


image.png


I will be glad to answer to any additional question.

Thanks in advance. 

Cheers, Marina







On Wed, 6 Feb 2019 at 12:39, Steven Lang <[hidden email]> wrote:
Hi Marina,

I had a more thorough look into this issue and was not able to fit a Glove model as well. 
It turns out that this is an issue with the backend library Deeplearning4j and their Glove model implementation (deadlock during model training as they push the task on a separate thread).
Good news is, that the problem was known and has just been fixed in Deeplearning4j 16 days ago (reference commit: https://github.com/deeplearning4j/deeplearning4j/commit/94c811e9a6e28e15ce17cbc467161028d7f822b1).
This means the Dl4jStringToGlove Filter will be working again as soon as Deeplearning4j publishes their next version.

I'm sorry for the inconvenience. 

The Dl4jStringToWord2Vec filter is not affected by this issue and works as expected.

Cheers,
Steven

On Wed, Feb 6, 2019 at 9:25 AM Marina Santini <[hidden email]> wrote:
Hi again, 

As I mentioned in my previous email, I was about to run a filtered classifier using smo and Dl4jStringToGlove filter. Unfortunately, I never got results. The combination has been running for 24 hours, never going further than the status you see in the picture that I copy below (the weka bird happily scampering on the RHS bottom corner of the screen). 

My guess is that the process is trapped in a loop somewhere. How can I troubleshoot this problem?

The dataset that I am using is relative small, less than 1 million words unevenly distributed over 1040 records. 

Just to have an idea, what would it be the average time to process a dataset of this size using  a filtered classifier that combines smo and Dl4jStringToGlove filter?  

Thanks in advance for your help. 

Cheers, Marina
image.png

On Tue, 5 Feb 2019 at 11:14, Marina Santini <[hidden email]> wrote:
Thanks a lot for your invaluable help, Steven.

Honestly, I would like to test all possible alternatives.

I am currently running a filtered classifier (namely SMO, which is usually much much much faster than any deep learning classifiers) combined with the Dl4jStringToGlove filter (see below), which means that word embeddings are learned from the current dataset. Theoretically, that would be nice, but it seems it takes ages (at least with standard parameters). So if Polyglot's pretrained embeddings will show a competitive performance and a much faster running time, the choice will be easy :-) 

image.png

I will get back with some results.

Thanks a lot again

Bye for now

Marina



On Tue, 5 Feb 2019 at 01:53, steven-lang <[hidden email]> wrote:
Hi Marina,

The Polyglot embeddings are saved as Python pickle files, so I just wrote a
script to convert these *.pkl files into *.csv files which can be used by
our Rnn/CnnTextEmbeddingInstanceIterator. I have just updated the NLP
documentation with conversion instructions
(https://deeplearning.cms.waikato.ac.nz/user-guide/nlp/#embeddings) which
should be online in the next 24 hours.

Until then, you can also use the Python script in the attachment. You need
Python 3.3 or higher with Numpy and Pandas installed.

polyglot_pkl_to_csv.py
<http://weka.8497.n7.nabble.com/file/t6770/polyglot_pkl_to_csv.py

In case you don't have access to a Python environment right now, I have
converted the Swedish embedding for you here:
https://drive.google.com/open?id=1PtIsUHy0fNVpzVqSYDvDtlwB4KMBxROt

The two filters (Dl4jStringToWord2Vec and Dl4jStringToGlove) are for the
case in which you want to learn a new embedding from a given dataset
yourself. But since you specifically mentioned the Polyglot embeddings I
assume that is not what you want to do? Correct me if I'm wrong.

Cheers,
Steven



--
Sent from: http://weka.8497.n7.nabble.com/
_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: WekaDeeplearning4j for text classification in Swedish

Marina Santini
Thanks Steven for the clarification and the detailed explanation. 

However, I do not see how to implement these 2 steps in the GUI: 
 - Use the Dl4jMlpClassifier as CNN   
- Use the RnnSequenceClassifier 

In my GUI I do not see any field that let me specify the 2 steps above. 

This is what I did: I uploaded the string dataset. 
I selected  Dl4jMlpClassifier  and below you can see the window that I have on the screen. I could specify the CnnTextEmbeddingInstanceIterator, as you can see and also the embeddings created with Dl4jStringToWord2Vec filter (an arff file).

image.png


image.png

This is certainly not enough since I get the following error

image.png

Do you think that my WekaDeeplearning4j package is incomplete? I installed it via the Gui package manager. 

Thanks in advance for your help.

Cheers, Marina




On Wed, 6 Feb 2019 at 15:25, Steven Lang <[hidden email]> wrote:
Hi Marina,

I think there is a misunderstanding on what the Dl4jStringToWord2Vec filter (and the Dl4jStringToGlove filter) does. We should be more clear in our documentation! 

The Dl4jStringToWord2Vec filter is used to generate a word embedding from a given dataset. That is, the ouptut of the filter is not the transformed input dataset but the word embedding that has been learned from the input - which you can save and use later on (see explanation below).

> To make a long story short: can I use word embeddings to get a classification output similar to the one below, where I used a filtered classifier that combines DI4jMlpClassifier and the stringToWordVector filter?

Yes you can. But keep in mind that using the StringToWordVector filter to transform each input instance is conceptually different from using an Embedding that is learned with Dl4jStringToWord2Vec/Glove:

- The StringToWordVector filter transforms each input instance into a vector of size M (size of your word dictionary)
- Using an embedding with the Cnn/RnnTextEmbeddingInstanceIterator (see below) will transform each input instance into a matrix of size K x M where K is your maximumum sequence length and M is your embedding size. Each token in an input sentence will be transformed via the word embedding into a vector and since your sentence multiple tokens each sentence will then be represented as a matrix

So if you want use do text classification with embeddings with WekaDeeplearning4j you take the following steps:

- Load your dataset as usual in the "Preprocess" pane
- Select a classifier provided by WekaDeeplearning4j. You've got two options here:
- Select the appropriate instance iterator in the classifier:
  - For the RnnSequenceClassifier select RnnTextEmbeddingInstanceIterator
  - For the Dl4jMlpclassifier select CnnTextEmbeddingInstanceIterator
- In the iterator options set "location of word vectors" to your serialized embedding file (either an ARFF file that you have created previously via Dl4jStringToWord2Vec or e.g. the polyglot-sv.csv embedding)
- Setup the appropriate network architecture and the network configuration
- Run the classifier

If you have any questions regarding this process feel free to ask.

Cheers,
Steven

On Wed, Feb 6, 2019 at 1:43 PM Marina Santini <[hidden email]> wrote:
Great! thanks for the quick reply, Steven. 

I confirm that the Dl4jStringToWord2Vec filter works fine. However I assume that it cannot be used with the FilteredClassifier since I get this error: 

image.png

 

 An old thread (https://wekalist.scms.waikato.ac.narkive.com/QsW3ezSP/class-index-not-set-when-using-filteredclassifier-with-attributeselection-filter-and) says that "The AttributeSelectionFilter does not set a class attribute again if the
attribute selection algorithm is an unsupervised one, since those might
remove the class attribute as well. So, for algorithms derived from the
following classes it will never work in the FilteredClassifier, due to
the missing class attribute:
weka.attributeSelection.UnsupervisedSubsetEvaluator
weka.attributeSelection.UnsupervisedAttributeEvaluator

"

I do not know whether this is the case also for Dl4jStringToWord2Vec. I know that this is NOT the case for the unsupervised filter called stringToWordVector, since I use it in the FilteredClassifier regularly.  

I tried to run Dl4jStringToWord2Vec in the Preprocess tab and saved the transformed dataset (attached here).
When I run any classifier on this dataset (including smo and DI4jMlpClassifier) I get the following error: 

image.png


I am following the instructions for the GUI in https://deeplearning.cms.waikato.ac.nz/ but I think I am missing something. But what?

To make a long story short: can I use word embeddings to get a classification output similar to the one below, where I used a filtered classifier that combines DI4jMlpClassifier and the stringToWordVector filter?


image.png


I will be glad to answer to any additional question.

Thanks in advance. 

Cheers, Marina







On Wed, 6 Feb 2019 at 12:39, Steven Lang <[hidden email]> wrote:
Hi Marina,

I had a more thorough look into this issue and was not able to fit a Glove model as well. 
It turns out that this is an issue with the backend library Deeplearning4j and their Glove model implementation (deadlock during model training as they push the task on a separate thread).
Good news is, that the problem was known and has just been fixed in Deeplearning4j 16 days ago (reference commit: https://github.com/deeplearning4j/deeplearning4j/commit/94c811e9a6e28e15ce17cbc467161028d7f822b1).
This means the Dl4jStringToGlove Filter will be working again as soon as Deeplearning4j publishes their next version.

I'm sorry for the inconvenience. 

The Dl4jStringToWord2Vec filter is not affected by this issue and works as expected.

Cheers,
Steven

On Wed, Feb 6, 2019 at 9:25 AM Marina Santini <[hidden email]> wrote:
Hi again, 

As I mentioned in my previous email, I was about to run a filtered classifier using smo and Dl4jStringToGlove filter. Unfortunately, I never got results. The combination has been running for 24 hours, never going further than the status you see in the picture that I copy below (the weka bird happily scampering on the RHS bottom corner of the screen). 

My guess is that the process is trapped in a loop somewhere. How can I troubleshoot this problem?

The dataset that I am using is relative small, less than 1 million words unevenly distributed over 1040 records. 

Just to have an idea, what would it be the average time to process a dataset of this size using  a filtered classifier that combines smo and Dl4jStringToGlove filter?  

Thanks in advance for your help. 

Cheers, Marina
image.png

On Tue, 5 Feb 2019 at 11:14, Marina Santini <[hidden email]> wrote:
Thanks a lot for your invaluable help, Steven.

Honestly, I would like to test all possible alternatives.

I am currently running a filtered classifier (namely SMO, which is usually much much much faster than any deep learning classifiers) combined with the Dl4jStringToGlove filter (see below), which means that word embeddings are learned from the current dataset. Theoretically, that would be nice, but it seems it takes ages (at least with standard parameters). So if Polyglot's pretrained embeddings will show a competitive performance and a much faster running time, the choice will be easy :-) 

image.png

I will get back with some results.

Thanks a lot again

Bye for now

Marina



On Tue, 5 Feb 2019 at 01:53, steven-lang <[hidden email]> wrote:
Hi Marina,

The Polyglot embeddings are saved as Python pickle files, so I just wrote a
script to convert these *.pkl files into *.csv files which can be used by
our Rnn/CnnTextEmbeddingInstanceIterator. I have just updated the NLP
documentation with conversion instructions
(https://deeplearning.cms.waikato.ac.nz/user-guide/nlp/#embeddings) which
should be online in the next 24 hours.

Until then, you can also use the Python script in the attachment. You need
Python 3.3 or higher with Numpy and Pandas installed.

polyglot_pkl_to_csv.py
<http://weka.8497.n7.nabble.com/file/t6770/polyglot_pkl_to_csv.py

In case you don't have access to a Python environment right now, I have
converted the Swedish embedding for you here:
https://drive.google.com/open?id=1PtIsUHy0fNVpzVqSYDvDtlwB4KMBxROt

The two filters (Dl4jStringToWord2Vec and Dl4jStringToGlove) are for the
case in which you want to learn a new embedding from a given dataset
yourself. But since you specifically mentioned the Polyglot embeddings I
assume that is not what you want to do? Correct me if I'm wrong.

Cheers,
Steven



--
Sent from: http://weka.8497.n7.nabble.com/
_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: WekaDeeplearning4j for text classification in Swedish

steven-lang
Hi Marina,

I think your setup is almost complete.
The error you see here justs states, that your ConvolutionLayer require the convolution mode "Same":

2019-02-06-181731_1920x1080_scrot.png

Also keep in mind:
- The convolution layers have to be followed by a single GlobalPoolingLayer
- Set the number of kernel columns to your embedding size 

This is also explained in our documentation.

Cheers,
Steven 



On Wed, Feb 6, 2019 at 5:46 PM Marina Santini <[hidden email]> wrote:
Thanks Steven for the clarification and the detailed explanation. 

However, I do not see how to implement these 2 steps in the GUI: 
 - Use the Dl4jMlpClassifier as CNN   
- Use the RnnSequenceClassifier 

In my GUI I do not see any field that let me specify the 2 steps above. 

This is what I did: I uploaded the string dataset. 
I selected  Dl4jMlpClassifier  and below you can see the window that I have on the screen. I could specify the CnnTextEmbeddingInstanceIterator, as you can see and also the embeddings created with Dl4jStringToWord2Vec filter (an arff file).

image.png


image.png

This is certainly not enough since I get the following error

image.png

Do you think that my WekaDeeplearning4j package is incomplete? I installed it via the Gui package manager. 

Thanks in advance for your help.

Cheers, Marina




On Wed, 6 Feb 2019 at 15:25, Steven Lang <[hidden email]> wrote:
Hi Marina,

I think there is a misunderstanding on what the Dl4jStringToWord2Vec filter (and the Dl4jStringToGlove filter) does. We should be more clear in our documentation! 

The Dl4jStringToWord2Vec filter is used to generate a word embedding from a given dataset. That is, the ouptut of the filter is not the transformed input dataset but the word embedding that has been learned from the input - which you can save and use later on (see explanation below).

> To make a long story short: can I use word embeddings to get a classification output similar to the one below, where I used a filtered classifier that combines DI4jMlpClassifier and the stringToWordVector filter?

Yes you can. But keep in mind that using the StringToWordVector filter to transform each input instance is conceptually different from using an Embedding that is learned with Dl4jStringToWord2Vec/Glove:

- The StringToWordVector filter transforms each input instance into a vector of size M (size of your word dictionary)
- Using an embedding with the Cnn/RnnTextEmbeddingInstanceIterator (see below) will transform each input instance into a matrix of size K x M where K is your maximumum sequence length and M is your embedding size. Each token in an input sentence will be transformed via the word embedding into a vector and since your sentence multiple tokens each sentence will then be represented as a matrix

So if you want use do text classification with embeddings with WekaDeeplearning4j you take the following steps:

- Load your dataset as usual in the "Preprocess" pane
- Select a classifier provided by WekaDeeplearning4j. You've got two options here:
- Select the appropriate instance iterator in the classifier:
  - For the RnnSequenceClassifier select RnnTextEmbeddingInstanceIterator
  - For the Dl4jMlpclassifier select CnnTextEmbeddingInstanceIterator
- In the iterator options set "location of word vectors" to your serialized embedding file (either an ARFF file that you have created previously via Dl4jStringToWord2Vec or e.g. the polyglot-sv.csv embedding)
- Setup the appropriate network architecture and the network configuration
- Run the classifier

If you have any questions regarding this process feel free to ask.

Cheers,
Steven

On Wed, Feb 6, 2019 at 1:43 PM Marina Santini <[hidden email]> wrote:
Great! thanks for the quick reply, Steven. 

I confirm that the Dl4jStringToWord2Vec filter works fine. However I assume that it cannot be used with the FilteredClassifier since I get this error: 

image.png

 

 An old thread (https://wekalist.scms.waikato.ac.narkive.com/QsW3ezSP/class-index-not-set-when-using-filteredclassifier-with-attributeselection-filter-and) says that "The AttributeSelectionFilter does not set a class attribute again if the
attribute selection algorithm is an unsupervised one, since those might
remove the class attribute as well. So, for algorithms derived from the
following classes it will never work in the FilteredClassifier, due to
the missing class attribute:
weka.attributeSelection.UnsupervisedSubsetEvaluator
weka.attributeSelection.UnsupervisedAttributeEvaluator

"

I do not know whether this is the case also for Dl4jStringToWord2Vec. I know that this is NOT the case for the unsupervised filter called stringToWordVector, since I use it in the FilteredClassifier regularly.  

I tried to run Dl4jStringToWord2Vec in the Preprocess tab and saved the transformed dataset (attached here).
When I run any classifier on this dataset (including smo and DI4jMlpClassifier) I get the following error: 

image.png


I am following the instructions for the GUI in https://deeplearning.cms.waikato.ac.nz/ but I think I am missing something. But what?

To make a long story short: can I use word embeddings to get a classification output similar to the one below, where I used a filtered classifier that combines DI4jMlpClassifier and the stringToWordVector filter?


image.png


I will be glad to answer to any additional question.

Thanks in advance. 

Cheers, Marina







On Wed, 6 Feb 2019 at 12:39, Steven Lang <[hidden email]> wrote:
Hi Marina,

I had a more thorough look into this issue and was not able to fit a Glove model as well. 
It turns out that this is an issue with the backend library Deeplearning4j and their Glove model implementation (deadlock during model training as they push the task on a separate thread).
Good news is, that the problem was known and has just been fixed in Deeplearning4j 16 days ago (reference commit: https://github.com/deeplearning4j/deeplearning4j/commit/94c811e9a6e28e15ce17cbc467161028d7f822b1).
This means the Dl4jStringToGlove Filter will be working again as soon as Deeplearning4j publishes their next version.

I'm sorry for the inconvenience. 

The Dl4jStringToWord2Vec filter is not affected by this issue and works as expected.

Cheers,
Steven

On Wed, Feb 6, 2019 at 9:25 AM Marina Santini <[hidden email]> wrote:
Hi again, 

As I mentioned in my previous email, I was about to run a filtered classifier using smo and Dl4jStringToGlove filter. Unfortunately, I never got results. The combination has been running for 24 hours, never going further than the status you see in the picture that I copy below (the weka bird happily scampering on the RHS bottom corner of the screen). 

My guess is that the process is trapped in a loop somewhere. How can I troubleshoot this problem?

The dataset that I am using is relative small, less than 1 million words unevenly distributed over 1040 records. 

Just to have an idea, what would it be the average time to process a dataset of this size using  a filtered classifier that combines smo and Dl4jStringToGlove filter?  

Thanks in advance for your help. 

Cheers, Marina
image.png

On Tue, 5 Feb 2019 at 11:14, Marina Santini <[hidden email]> wrote:
Thanks a lot for your invaluable help, Steven.

Honestly, I would like to test all possible alternatives.

I am currently running a filtered classifier (namely SMO, which is usually much much much faster than any deep learning classifiers) combined with the Dl4jStringToGlove filter (see below), which means that word embeddings are learned from the current dataset. Theoretically, that would be nice, but it seems it takes ages (at least with standard parameters). So if Polyglot's pretrained embeddings will show a competitive performance and a much faster running time, the choice will be easy :-) 

image.png

I will get back with some results.

Thanks a lot again

Bye for now

Marina



On Tue, 5 Feb 2019 at 01:53, steven-lang <[hidden email]> wrote:
Hi Marina,

The Polyglot embeddings are saved as Python pickle files, so I just wrote a
script to convert these *.pkl files into *.csv files which can be used by
our Rnn/CnnTextEmbeddingInstanceIterator. I have just updated the NLP
documentation with conversion instructions
(https://deeplearning.cms.waikato.ac.nz/user-guide/nlp/#embeddings) which
should be online in the next 24 hours.

Until then, you can also use the Python script in the attachment. You need
Python 3.3 or higher with Numpy and Pandas installed.

polyglot_pkl_to_csv.py
<http://weka.8497.n7.nabble.com/file/t6770/polyglot_pkl_to_csv.py

In case you don't have access to a Python environment right now, I have
converted the Swedish embedding for you here:
https://drive.google.com/open?id=1PtIsUHy0fNVpzVqSYDvDtlwB4KMBxROt

The two filters (Dl4jStringToWord2Vec and Dl4jStringToGlove) are for the
case in which you want to learn a new embedding from a given dataset
yourself. But since you specifically mentioned the Polyglot embeddings I
assume that is not what you want to do? Correct me if I'm wrong.

Cheers,
Steven



--
Sent from: http://weka.8497.n7.nabble.com/
_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: WekaDeeplearning4j for text classification in Swedish

Felipe Bravo
Hi,
There is a way to use  Glove or Word2Vec filters for text classification. If you set the action parameter to DOC_VECTOR_AVERAGE, the filter will calculate word embeddings from the corpus of documents and then transform each document as the average vector of its words.  Since the filter doesn't remove the String attribute with the document content, you should also include the RemoveByType Filter into a MultiFilter pipeline if you want to use the filter together with the FilteredClassifier.

An example using Word2Vec is given below:

weka.classifiers.meta.FilteredClassifier -F "weka.filters.MultiFilter -F \"weka.filters.unsupervised.attribute.Dl4jStringToWord2Vec -allowParallelTokenization -batchSize 512 -learningRate 0.025 -minLearningRate 1.0E-4 -negative 0.0 -sampling 0.0 -useHierarchicSoftmax -action DOC_VECTOR_AVERAGE -concat_words 15 -embedding_prefix embedding- -epochs 4 -iterations 1 -layerSize 100 -minWordFrequency 5 -preprocessor \\\"weka.dl4j.text.tokenization.preprocessor.CommonPreProcessor \\\" -seed 1 -stopWordsHandler \\\"weka.dl4j.text.stopwords.Dl4jNull \\\" -index 1 -tokenizerFactory \\\"weka.dl4j.text.tokenization.tokenizer.factory.DefaultTokenizerFactory \\\" -windowSize 5 -workers 6\" -F \"weka.filters.unsupervised.attribute.RemoveType -T string\"" -S 1 -W weka.classifiers.functions.Dl4jMlpClassifier -- -S 1 -cache-mode MEMORY -early-stopping "weka.dl4j.earlystopping.EarlyStopping -maxEpochsNoImprovement 0 -valPercentage 0.0" -normalization "Standardize training data" -iterator "weka.dl4j.iterators.instance.DefaultInstanceIterator -bs 1" -iteration-listener "weka.dl4j.listener.EpochListener -eval true -n 5" -layer "weka.dl4j.layers.OutputLayer -lossFn \"weka.dl4j.lossfunctions.LossMCXENT \" -nOut 0 -activation \"weka.dl4j.activations.ActivationSoftmax \" -name \"Output layer\"" -logConfig "weka.core.LogConfiguration -append true -dl4jLogLevel WARN -logFile /home/fbravoma/wekafiles/wekaDeeplearning4j.log -nd4jLogLevel INFO -wekaDl4jLogLevel INFO" -config "weka.dl4j.NeuralNetConfiguration -biasInit 0.0 -biasUpdater \"weka.dl4j.updater.Sgd -lr 0.001 -lrSchedule \\\"weka.dl4j.schedules.ConstantSchedule -scheduleType EPOCH\\\"\" -dist \"weka.dl4j.distribution.Disabled \" -dropout \"weka.dl4j.dropout.Disabled \" -gradientNormalization None -gradNormThreshold 1.0 -l1 NaN -l2 NaN -minimize -algorithm STOCHASTIC_GRADIENT_DESCENT -updater \"weka.dl4j.updater.Adam -beta1MeanDecay 0.9 -beta2VarDecay 0.999 -epsilon 1.0E-8 -lr 0.001 -lrSchedule \\\"weka.dl4j.schedules.ConstantSchedule -scheduleType EPOCH\\\"\" -weightInit XAVIER -weightNoise \"weka.dl4j.weightnoise.Disabled \"" -numEpochs 10 -queueSize 0 -zooModel "weka.dl4j.zoo.CustomNet "


You need to have a large corpus to get good results with this approach though.
Cheers,
Felipe

On Thu, Feb 7, 2019 at 7:31 AM Steven Lang <[hidden email]> wrote:
Hi Marina,

I think your setup is almost complete.
The error you see here justs states, that your ConvolutionLayer require the convolution mode "Same":

2019-02-06-181731_1920x1080_scrot.png

Also keep in mind:
- The convolution layers have to be followed by a single GlobalPoolingLayer
- Set the number of kernel columns to your embedding size 

This is also explained in our documentation.

Cheers,
Steven 



On Wed, Feb 6, 2019 at 5:46 PM Marina Santini <[hidden email]> wrote:
Thanks Steven for the clarification and the detailed explanation. 

However, I do not see how to implement these 2 steps in the GUI: 
 - Use the Dl4jMlpClassifier as CNN   
- Use the RnnSequenceClassifier 

In my GUI I do not see any field that let me specify the 2 steps above. 

This is what I did: I uploaded the string dataset. 
I selected  Dl4jMlpClassifier  and below you can see the window that I have on the screen. I could specify the CnnTextEmbeddingInstanceIterator, as you can see and also the embeddings created with Dl4jStringToWord2Vec filter (an arff file).

image.png


image.png

This is certainly not enough since I get the following error

image.png

Do you think that my WekaDeeplearning4j package is incomplete? I installed it via the Gui package manager. 

Thanks in advance for your help.

Cheers, Marina




On Wed, 6 Feb 2019 at 15:25, Steven Lang <[hidden email]> wrote:
Hi Marina,

I think there is a misunderstanding on what the Dl4jStringToWord2Vec filter (and the Dl4jStringToGlove filter) does. We should be more clear in our documentation! 

The Dl4jStringToWord2Vec filter is used to generate a word embedding from a given dataset. That is, the ouptut of the filter is not the transformed input dataset but the word embedding that has been learned from the input - which you can save and use later on (see explanation below).

> To make a long story short: can I use word embeddings to get a classification output similar to the one below, where I used a filtered classifier that combines DI4jMlpClassifier and the stringToWordVector filter?

Yes you can. But keep in mind that using the StringToWordVector filter to transform each input instance is conceptually different from using an Embedding that is learned with Dl4jStringToWord2Vec/Glove:

- The StringToWordVector filter transforms each input instance into a vector of size M (size of your word dictionary)
- Using an embedding with the Cnn/RnnTextEmbeddingInstanceIterator (see below) will transform each input instance into a matrix of size K x M where K is your maximumum sequence length and M is your embedding size. Each token in an input sentence will be transformed via the word embedding into a vector and since your sentence multiple tokens each sentence will then be represented as a matrix

So if you want use do text classification with embeddings with WekaDeeplearning4j you take the following steps:

- Load your dataset as usual in the "Preprocess" pane
- Select a classifier provided by WekaDeeplearning4j. You've got two options here:
- Select the appropriate instance iterator in the classifier:
  - For the RnnSequenceClassifier select RnnTextEmbeddingInstanceIterator
  - For the Dl4jMlpclassifier select CnnTextEmbeddingInstanceIterator
- In the iterator options set "location of word vectors" to your serialized embedding file (either an ARFF file that you have created previously via Dl4jStringToWord2Vec or e.g. the polyglot-sv.csv embedding)
- Setup the appropriate network architecture and the network configuration
- Run the classifier

If you have any questions regarding this process feel free to ask.

Cheers,
Steven

On Wed, Feb 6, 2019 at 1:43 PM Marina Santini <[hidden email]> wrote:
Great! thanks for the quick reply, Steven. 

I confirm that the Dl4jStringToWord2Vec filter works fine. However I assume that it cannot be used with the FilteredClassifier since I get this error: 

image.png

 

 An old thread (https://wekalist.scms.waikato.ac.narkive.com/QsW3ezSP/class-index-not-set-when-using-filteredclassifier-with-attributeselection-filter-and) says that "The AttributeSelectionFilter does not set a class attribute again if the
attribute selection algorithm is an unsupervised one, since those might
remove the class attribute as well. So, for algorithms derived from the
following classes it will never work in the FilteredClassifier, due to
the missing class attribute:
weka.attributeSelection.UnsupervisedSubsetEvaluator
weka.attributeSelection.UnsupervisedAttributeEvaluator

"

I do not know whether this is the case also for Dl4jStringToWord2Vec. I know that this is NOT the case for the unsupervised filter called stringToWordVector, since I use it in the FilteredClassifier regularly.  

I tried to run Dl4jStringToWord2Vec in the Preprocess tab and saved the transformed dataset (attached here).
When I run any classifier on this dataset (including smo and DI4jMlpClassifier) I get the following error: 

image.png


I am following the instructions for the GUI in https://deeplearning.cms.waikato.ac.nz/ but I think I am missing something. But what?

To make a long story short: can I use word embeddings to get a classification output similar to the one below, where I used a filtered classifier that combines DI4jMlpClassifier and the stringToWordVector filter?


image.png


I will be glad to answer to any additional question.

Thanks in advance. 

Cheers, Marina







On Wed, 6 Feb 2019 at 12:39, Steven Lang <[hidden email]> wrote:
Hi Marina,

I had a more thorough look into this issue and was not able to fit a Glove model as well. 
It turns out that this is an issue with the backend library Deeplearning4j and their Glove model implementation (deadlock during model training as they push the task on a separate thread).
Good news is, that the problem was known and has just been fixed in Deeplearning4j 16 days ago (reference commit: https://github.com/deeplearning4j/deeplearning4j/commit/94c811e9a6e28e15ce17cbc467161028d7f822b1).
This means the Dl4jStringToGlove Filter will be working again as soon as Deeplearning4j publishes their next version.

I'm sorry for the inconvenience. 

The Dl4jStringToWord2Vec filter is not affected by this issue and works as expected.

Cheers,
Steven

On Wed, Feb 6, 2019 at 9:25 AM Marina Santini <[hidden email]> wrote:
Hi again, 

As I mentioned in my previous email, I was about to run a filtered classifier using smo and Dl4jStringToGlove filter. Unfortunately, I never got results. The combination has been running for 24 hours, never going further than the status you see in the picture that I copy below (the weka bird happily scampering on the RHS bottom corner of the screen). 

My guess is that the process is trapped in a loop somewhere. How can I troubleshoot this problem?

The dataset that I am using is relative small, less than 1 million words unevenly distributed over 1040 records. 

Just to have an idea, what would it be the average time to process a dataset of this size using  a filtered classifier that combines smo and Dl4jStringToGlove filter?  

Thanks in advance for your help. 

Cheers, Marina
image.png

On Tue, 5 Feb 2019 at 11:14, Marina Santini <[hidden email]> wrote:
Thanks a lot for your invaluable help, Steven.

Honestly, I would like to test all possible alternatives.

I am currently running a filtered classifier (namely SMO, which is usually much much much faster than any deep learning classifiers) combined with the Dl4jStringToGlove filter (see below), which means that word embeddings are learned from the current dataset. Theoretically, that would be nice, but it seems it takes ages (at least with standard parameters). So if Polyglot's pretrained embeddings will show a competitive performance and a much faster running time, the choice will be easy :-) 

image.png

I will get back with some results.

Thanks a lot again

Bye for now

Marina



On Tue, 5 Feb 2019 at 01:53, steven-lang <[hidden email]> wrote:
Hi Marina,

The Polyglot embeddings are saved as Python pickle files, so I just wrote a
script to convert these *.pkl files into *.csv files which can be used by
our Rnn/CnnTextEmbeddingInstanceIterator. I have just updated the NLP
documentation with conversion instructions
(https://deeplearning.cms.waikato.ac.nz/user-guide/nlp/#embeddings) which
should be online in the next 24 hours.

Until then, you can also use the Python script in the attachment. You need
Python 3.3 or higher with Numpy and Pandas installed.

polyglot_pkl_to_csv.py
<http://weka.8497.n7.nabble.com/file/t6770/polyglot_pkl_to_csv.py

In case you don't have access to a Python environment right now, I have
converted the Swedish embedding for you here:
https://drive.google.com/open?id=1PtIsUHy0fNVpzVqSYDvDtlwB4KMBxROt

The two filters (Dl4jStringToWord2Vec and Dl4jStringToGlove) are for the
case in which you want to learn a new embedding from a given dataset
yourself. But since you specifically mentioned the Polyglot embeddings I
assume that is not what you want to do? Correct me if I'm wrong.

Cheers,
Steven



--
Sent from: http://weka.8497.n7.nabble.com/
_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: WekaDeeplearning4j for text classification in Swedish

Marina Santini
Thanks a lot Felipe for sharing your experience. 

I tried to apply your suggestion. Here is my scheme: 
weka.classifiers.meta.FilteredClassifier -F "weka.filters.MultiFilter -F \"weka.filters.unsupervised.attribute.Dl4jStringToWord2Vec -allowParallelTokenization -batchSize 512 -learningRate 0.025 -minLearningRate 1.0E-4 -negative 0.0 -sampling 0.0 -useHierarchicSoftmax -action DOC_VECTOR_AVERAGE -concat_words 15 -embedding_prefix embedding- -epochs 1 -iterations 1 -layerSize 100 -minWordFrequency 5 -preprocessor \\\"weka.dl4j.text.tokenization.preprocessor.CommonPreProcessor \\\" -seed 1 -stopWordsHandler \\\"weka.dl4j.text.stopwords.Dl4jNull \\\" -index 1 -tokenizerFactory \\\"weka.dl4j.text.tokenization.tokenizer.factory.DefaultTokenizerFactory \\\" -windowSize 5 -workers 4\" -F \"weka.filters.unsupervised.attribute.RemoveType -T string\"" -W weka.classifiers.functions.Dl4jMlpClassifier -- -S 1 -cache-mode MEMORY -early-stopping "weka.dl4j.earlystopping.EarlyStopping -maxEpochsNoImprovement 0 -valPercentage 0.0" -normalization "Standardize training data" -iterator "weka.dl4j.iterators.instance.DefaultInstanceIterator -bs 1" -iteration-listener "weka.dl4j.listener.EpochListener -eval true -n 5" -layer "weka.dl4j.layers.ConvolutionLayer -nFilters 32 -mode Same -cudnnAlgoMode PREFER_FASTEST -rows 3 -columns 3 -paddingColumns 0 -paddingRows 0 -strideColumns 1 -strideRows 1 -nOut 32 -activation \"weka.dl4j.activations.ActivationIdentity \" -name \"Convolution layer\"" -layer "weka.dl4j.layers.OutputLayer -lossFn \"weka.dl4j.lossfunctions.LossMCXENT \" -nOut 0 -activation \"weka.dl4j.activations.ActivationSoftmax \" -name \"Output layer\"" -logConfig "weka.core.LogConfiguration -append true -dl4jLogLevel WARN -logFile C:\\Users\\MarinaSantini\\wekafiles\\wekaDeeplearning4j.log -nd4jLogLevel INFO -wekaDl4jLogLevel INFO" -config "weka.dl4j.NeuralNetConfiguration -biasInit 0.0 -biasUpdater \"weka.dl4j.updater.Sgd -lr 0.001 -lrSchedule \\\"weka.dl4j.schedules.ConstantSchedule -scheduleType EPOCH\\\"\" -dist \"weka.dl4j.distribution.Disabled \" -dropout \"weka.dl4j.dropout.Disabled \" -gradientNormalization None -gradNormThreshold 1.0 -l1 NaN -l2 NaN -minimize -algorithm STOCHASTIC_GRADIENT_DESCENT -updater \"weka.dl4j.updater.Adam -beta1MeanDecay 0.9 -beta2VarDecay 0.999 -epsilon 1.0E-8 -lr 0.001 -lrSchedule \\\"weka.dl4j.schedules.ConstantSchedule -scheduleType EPOCH\\\"\" -weightInit XAVIER -weightNoise \"weka.dl4j.weightnoise.Disabled \"" -numEpochs 10 -queueSize 0 -zooModel "weka.dl4j.zoo.CustomNet "

Unfortunately I get this error: 

image.png

Any suggestions?

Cheers, Marina

On Wed, 6 Feb 2019 at 22:05, Felipe Bravo <[hidden email]> wrote:
Hi,
There is a way to use  Glove or Word2Vec filters for text classification. If you set the action parameter to DOC_VECTOR_AVERAGE, the filter will calculate word embeddings from the corpus of documents and then transform each document as the average vector of its words.  Since the filter doesn't remove the String attribute with the document content, you should also include the RemoveByType Filter into a MultiFilter pipeline if you want to use the filter together with the FilteredClassifier.

An example using Word2Vec is given below:

weka.classifiers.meta.FilteredClassifier -F "weka.filters.MultiFilter -F \"weka.filters.unsupervised.attribute.Dl4jStringToWord2Vec -allowParallelTokenization -batchSize 512 -learningRate 0.025 -minLearningRate 1.0E-4 -negative 0.0 -sampling 0.0 -useHierarchicSoftmax -action DOC_VECTOR_AVERAGE -concat_words 15 -embedding_prefix embedding- -epochs 4 -iterations 1 -layerSize 100 -minWordFrequency 5 -preprocessor \\\"weka.dl4j.text.tokenization.preprocessor.CommonPreProcessor \\\" -seed 1 -stopWordsHandler \\\"weka.dl4j.text.stopwords.Dl4jNull \\\" -index 1 -tokenizerFactory \\\"weka.dl4j.text.tokenization.tokenizer.factory.DefaultTokenizerFactory \\\" -windowSize 5 -workers 6\" -F \"weka.filters.unsupervised.attribute.RemoveType -T string\"" -S 1 -W weka.classifiers.functions.Dl4jMlpClassifier -- -S 1 -cache-mode MEMORY -early-stopping "weka.dl4j.earlystopping.EarlyStopping -maxEpochsNoImprovement 0 -valPercentage 0.0" -normalization "Standardize training data" -iterator "weka.dl4j.iterators.instance.DefaultInstanceIterator -bs 1" -iteration-listener "weka.dl4j.listener.EpochListener -eval true -n 5" -layer "weka.dl4j.layers.OutputLayer -lossFn \"weka.dl4j.lossfunctions.LossMCXENT \" -nOut 0 -activation \"weka.dl4j.activations.ActivationSoftmax \" -name \"Output layer\"" -logConfig "weka.core.LogConfiguration -append true -dl4jLogLevel WARN -logFile /home/fbravoma/wekafiles/wekaDeeplearning4j.log -nd4jLogLevel INFO -wekaDl4jLogLevel INFO" -config "weka.dl4j.NeuralNetConfiguration -biasInit 0.0 -biasUpdater \"weka.dl4j.updater.Sgd -lr 0.001 -lrSchedule \\\"weka.dl4j.schedules.ConstantSchedule -scheduleType EPOCH\\\"\" -dist \"weka.dl4j.distribution.Disabled \" -dropout \"weka.dl4j.dropout.Disabled \" -gradientNormalization None -gradNormThreshold 1.0 -l1 NaN -l2 NaN -minimize -algorithm STOCHASTIC_GRADIENT_DESCENT -updater \"weka.dl4j.updater.Adam -beta1MeanDecay 0.9 -beta2VarDecay 0.999 -epsilon 1.0E-8 -lr 0.001 -lrSchedule \\\"weka.dl4j.schedules.ConstantSchedule -scheduleType EPOCH\\\"\" -weightInit XAVIER -weightNoise \"weka.dl4j.weightnoise.Disabled \"" -numEpochs 10 -queueSize 0 -zooModel "weka.dl4j.zoo.CustomNet "


You need to have a large corpus to get good results with this approach though.
Cheers,
Felipe

On Thu, Feb 7, 2019 at 7:31 AM Steven Lang <[hidden email]> wrote:
Hi Marina,

I think your setup is almost complete.
The error you see here justs states, that your ConvolutionLayer require the convolution mode "Same":

2019-02-06-181731_1920x1080_scrot.png

Also keep in mind:
- The convolution layers have to be followed by a single GlobalPoolingLayer
- Set the number of kernel columns to your embedding size 

This is also explained in our documentation.

Cheers,
Steven 



On Wed, Feb 6, 2019 at 5:46 PM Marina Santini <[hidden email]> wrote:
Thanks Steven for the clarification and the detailed explanation. 

However, I do not see how to implement these 2 steps in the GUI: 
 - Use the Dl4jMlpClassifier as CNN   
- Use the RnnSequenceClassifier 

In my GUI I do not see any field that let me specify the 2 steps above. 

This is what I did: I uploaded the string dataset. 
I selected  Dl4jMlpClassifier  and below you can see the window that I have on the screen. I could specify the CnnTextEmbeddingInstanceIterator, as you can see and also the embeddings created with Dl4jStringToWord2Vec filter (an arff file).

image.png


image.png

This is certainly not enough since I get the following error

image.png

Do you think that my WekaDeeplearning4j package is incomplete? I installed it via the Gui package manager. 

Thanks in advance for your help.

Cheers, Marina




On Wed, 6 Feb 2019 at 15:25, Steven Lang <[hidden email]> wrote:
Hi Marina,

I think there is a misunderstanding on what the Dl4jStringToWord2Vec filter (and the Dl4jStringToGlove filter) does. We should be more clear in our documentation! 

The Dl4jStringToWord2Vec filter is used to generate a word embedding from a given dataset. That is, the ouptut of the filter is not the transformed input dataset but the word embedding that has been learned from the input - which you can save and use later on (see explanation below).

> To make a long story short: can I use word embeddings to get a classification output similar to the one below, where I used a filtered classifier that combines DI4jMlpClassifier and the stringToWordVector filter?

Yes you can. But keep in mind that using the StringToWordVector filter to transform each input instance is conceptually different from using an Embedding that is learned with Dl4jStringToWord2Vec/Glove:

- The StringToWordVector filter transforms each input instance into a vector of size M (size of your word dictionary)
- Using an embedding with the Cnn/RnnTextEmbeddingInstanceIterator (see below) will transform each input instance into a matrix of size K x M where K is your maximumum sequence length and M is your embedding size. Each token in an input sentence will be transformed via the word embedding into a vector and since your sentence multiple tokens each sentence will then be represented as a matrix

So if you want use do text classification with embeddings with WekaDeeplearning4j you take the following steps:

- Load your dataset as usual in the "Preprocess" pane
- Select a classifier provided by WekaDeeplearning4j. You've got two options here:
- Select the appropriate instance iterator in the classifier:
  - For the RnnSequenceClassifier select RnnTextEmbeddingInstanceIterator
  - For the Dl4jMlpclassifier select CnnTextEmbeddingInstanceIterator
- In the iterator options set "location of word vectors" to your serialized embedding file (either an ARFF file that you have created previously via Dl4jStringToWord2Vec or e.g. the polyglot-sv.csv embedding)
- Setup the appropriate network architecture and the network configuration
- Run the classifier

If you have any questions regarding this process feel free to ask.

Cheers,
Steven

On Wed, Feb 6, 2019 at 1:43 PM Marina Santini <[hidden email]> wrote:
Great! thanks for the quick reply, Steven. 

I confirm that the Dl4jStringToWord2Vec filter works fine. However I assume that it cannot be used with the FilteredClassifier since I get this error: 

image.png

 

 An old thread (https://wekalist.scms.waikato.ac.narkive.com/QsW3ezSP/class-index-not-set-when-using-filteredclassifier-with-attributeselection-filter-and) says that "The AttributeSelectionFilter does not set a class attribute again if the
attribute selection algorithm is an unsupervised one, since those might
remove the class attribute as well. So, for algorithms derived from the
following classes it will never work in the FilteredClassifier, due to
the missing class attribute:
weka.attributeSelection.UnsupervisedSubsetEvaluator
weka.attributeSelection.UnsupervisedAttributeEvaluator

"

I do not know whether this is the case also for Dl4jStringToWord2Vec. I know that this is NOT the case for the unsupervised filter called stringToWordVector, since I use it in the FilteredClassifier regularly.  

I tried to run Dl4jStringToWord2Vec in the Preprocess tab and saved the transformed dataset (attached here).
When I run any classifier on this dataset (including smo and DI4jMlpClassifier) I get the following error: 

image.png


I am following the instructions for the GUI in https://deeplearning.cms.waikato.ac.nz/ but I think I am missing something. But what?

To make a long story short: can I use word embeddings to get a classification output similar to the one below, where I used a filtered classifier that combines DI4jMlpClassifier and the stringToWordVector filter?


image.png


I will be glad to answer to any additional question.

Thanks in advance. 

Cheers, Marina







On Wed, 6 Feb 2019 at 12:39, Steven Lang <[hidden email]> wrote:
Hi Marina,

I had a more thorough look into this issue and was not able to fit a Glove model as well. 
It turns out that this is an issue with the backend library Deeplearning4j and their Glove model implementation (deadlock during model training as they push the task on a separate thread).
Good news is, that the problem was known and has just been fixed in Deeplearning4j 16 days ago (reference commit: https://github.com/deeplearning4j/deeplearning4j/commit/94c811e9a6e28e15ce17cbc467161028d7f822b1).
This means the Dl4jStringToGlove Filter will be working again as soon as Deeplearning4j publishes their next version.

I'm sorry for the inconvenience. 

The Dl4jStringToWord2Vec filter is not affected by this issue and works as expected.

Cheers,
Steven

On Wed, Feb 6, 2019 at 9:25 AM Marina Santini <[hidden email]> wrote:
Hi again, 

As I mentioned in my previous email, I was about to run a filtered classifier using smo and Dl4jStringToGlove filter. Unfortunately, I never got results. The combination has been running for 24 hours, never going further than the status you see in the picture that I copy below (the weka bird happily scampering on the RHS bottom corner of the screen). 

My guess is that the process is trapped in a loop somewhere. How can I troubleshoot this problem?

The dataset that I am using is relative small, less than 1 million words unevenly distributed over 1040 records. 

Just to have an idea, what would it be the average time to process a dataset of this size using  a filtered classifier that combines smo and Dl4jStringToGlove filter?  

Thanks in advance for your help. 

Cheers, Marina
image.png

On Tue, 5 Feb 2019 at 11:14, Marina Santini <[hidden email]> wrote:
Thanks a lot for your invaluable help, Steven.

Honestly, I would like to test all possible alternatives.

I am currently running a filtered classifier (namely SMO, which is usually much much much faster than any deep learning classifiers) combined with the Dl4jStringToGlove filter (see below), which means that word embeddings are learned from the current dataset. Theoretically, that would be nice, but it seems it takes ages (at least with standard parameters). So if Polyglot's pretrained embeddings will show a competitive performance and a much faster running time, the choice will be easy :-) 

image.png

I will get back with some results.

Thanks a lot again

Bye for now

Marina



On Tue, 5 Feb 2019 at 01:53, steven-lang <[hidden email]> wrote:
Hi Marina,

The Polyglot embeddings are saved as Python pickle files, so I just wrote a
script to convert these *.pkl files into *.csv files which can be used by
our Rnn/CnnTextEmbeddingInstanceIterator. I have just updated the NLP
documentation with conversion instructions
(https://deeplearning.cms.waikato.ac.nz/user-guide/nlp/#embeddings) which
should be online in the next 24 hours.

Until then, you can also use the Python script in the attachment. You need
Python 3.3 or higher with Numpy and Pandas installed.

polyglot_pkl_to_csv.py
<http://weka.8497.n7.nabble.com/file/t6770/polyglot_pkl_to_csv.py

In case you don't have access to a Python environment right now, I have
converted the Swedish embedding for you here:
https://drive.google.com/open?id=1PtIsUHy0fNVpzVqSYDvDtlwB4KMBxROt

The two filters (Dl4jStringToWord2Vec and Dl4jStringToGlove) are for the
case in which you want to learn a new embedding from a given dataset
yourself. But since you specifically mentioned the Polyglot embeddings I
assume that is not what you want to do? Correct me if I'm wrong.

Cheers,
Steven



--
Sent from: http://weka.8497.n7.nabble.com/
_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: WekaDeeplearning4j for text classification in Swedish

Marina Santini
In reply to this post by steven-lang
Hi Steven, 

and thanks again for your patience and kindness. 

I tried to apply what you said, and I ran the following scheme: 

 weka.classifiers.functions.Dl4jMlpClassifier -S 1 -cache-mode MEMORY -early-stopping "weka.dl4j.earlystopping.EarlyStopping -maxEpochsNoImprovement 0 -valPercentage 0.0" -normalization "Standardize training data" -iterator "weka.dl4j.iterators.instance.sequence.text.cnn.CnnTextEmbeddingInstanceIterator -stopWords \"weka.dl4j.text.stopwords.Dl4jRainbow \" -tokenPreProcessor \"weka.dl4j.text.tokenization.preprocessor.CommonPreProcessor \" -tokenizerFactory \"weka.dl4j.text.tokenization.tokenizer.factory.DefaultTokenizerFactory \" -truncationLength 100 -wordVectorLocation D:\\zz_myPubs\\2019_benjamin\\suc_datasets\\09Varieties_word2Vec_word_embeddings.arff -bs 1" -iteration-listener "weka.dl4j.listener.EpochListener -eval true -n 5" -layer "weka.dl4j.layers.ConvolutionLayer -nFilters 64 -mode Same -cudnnAlgoMode PREFER_FASTEST -rows 5 -columns 5 -paddingColumns 0 -paddingRows 0 -strideColumns 1 -strideRows 1 -nOut 64 -activation \"weka.dl4j.activations.ActivationIdentity \" -name \"Convolution layer 1\"" -layer "weka.dl4j.layers.ConvolutionLayer -nFilters 32 -mode Same -cudnnAlgoMode PREFER_FASTEST -rows 5 -columns 5 -paddingColumns 0 -paddingRows 0 -strideColumns 1 -strideRows 1 -nOut 32 -activation \"weka.dl4j.activations.ActivationIdentity \" -name \"Convolution layer 2\"" -layer "weka.dl4j.layers.GlobalPoolingLayer -collapseDimensions true -pnorm 2 -poolingType MAX -name \"GlobalPooling layer\"" -layer "weka.dl4j.layers.OutputLayer -lossFn \"weka.dl4j.lossfunctions.LossMCXENT \" -nOut 0 -activation \"weka.dl4j.activations.ActivationSoftmax \" -name \"Output layer\"" -logConfig "weka.core.LogConfiguration -append true -dl4jLogLevel WARN -logFile C:\\Users\\MarinaSantini\\wekafiles\\wekaDeeplearning4j.log -nd4jLogLevel INFO -wekaDl4jLogLevel INFO" -config "weka.dl4j.NeuralNetConfiguration -biasInit 0.0 -biasUpdater \"weka.dl4j.updater.Sgd -lr 0.001 -lrSchedule \\\"weka.dl4j.schedules.ConstantSchedule -scheduleType EPOCH\\\"\" -dist \"weka.dl4j.distribution.Disabled \" -dropout \"weka.dl4j.dropout.Disabled \" -gradientNormalization None -gradNormThreshold 1.0 -l1 NaN -l2 NaN -minimize -algorithm STOCHASTIC_GRADIENT_DESCENT -updater \"weka.dl4j.updater.Adam -beta1MeanDecay 0.9 -beta2VarDecay 0.999 -epsilon 1.0E-8 -lr 0.001 -lrSchedule \\\"weka.dl4j.schedules.ConstantSchedule -scheduleType EPOCH\\\"\" -weightInit XAVIER -weightNoise \"weka.dl4j.weightnoise.Disabled \"" -numEpochs 10 -queueSize 0 -zooModel "weka.dl4j.zoo.CustomNet "

I get this error: 
image.png

I don't know what kernel height means. I know that  theoretically kernel size is same as size of the longest word in the vocabulary. In the scheme, I accepted the default value, ie 5 for both row and column kernel. 

where shall I specify the kernel height and what's the default value for it?

Big thanks
Cheers, Marina


On Wed, 6 Feb 2019 at 18:28, Steven Lang <[hidden email]> wrote:
Hi Marina,

I think your setup is almost complete.
The error you see here justs states, that your ConvolutionLayer require the convolution mode "Same":

2019-02-06-181731_1920x1080_scrot.png

Also keep in mind:
- The convolution layers have to be followed by a single GlobalPoolingLayer
- Set the number of kernel columns to your embedding size 

This is also explained in our documentation.

Cheers,
Steven 



On Wed, Feb 6, 2019 at 5:46 PM Marina Santini <[hidden email]> wrote:
Thanks Steven for the clarification and the detailed explanation. 

However, I do not see how to implement these 2 steps in the GUI: 
 - Use the Dl4jMlpClassifier as CNN   
- Use the RnnSequenceClassifier 

In my GUI I do not see any field that let me specify the 2 steps above. 

This is what I did: I uploaded the string dataset. 
I selected  Dl4jMlpClassifier  and below you can see the window that I have on the screen. I could specify the CnnTextEmbeddingInstanceIterator, as you can see and also the embeddings created with Dl4jStringToWord2Vec filter (an arff file).

image.png


image.png

This is certainly not enough since I get the following error

image.png

Do you think that my WekaDeeplearning4j package is incomplete? I installed it via the Gui package manager. 

Thanks in advance for your help.

Cheers, Marina




On Wed, 6 Feb 2019 at 15:25, Steven Lang <[hidden email]> wrote:
Hi Marina,

I think there is a misunderstanding on what the Dl4jStringToWord2Vec filter (and the Dl4jStringToGlove filter) does. We should be more clear in our documentation! 

The Dl4jStringToWord2Vec filter is used to generate a word embedding from a given dataset. That is, the ouptut of the filter is not the transformed input dataset but the word embedding that has been learned from the input - which you can save and use later on (see explanation below).

> To make a long story short: can I use word embeddings to get a classification output similar to the one below, where I used a filtered classifier that combines DI4jMlpClassifier and the stringToWordVector filter?

Yes you can. But keep in mind that using the StringToWordVector filter to transform each input instance is conceptually different from using an Embedding that is learned with Dl4jStringToWord2Vec/Glove:

- The StringToWordVector filter transforms each input instance into a vector of size M (size of your word dictionary)
- Using an embedding with the Cnn/RnnTextEmbeddingInstanceIterator (see below) will transform each input instance into a matrix of size K x M where K is your maximumum sequence length and M is your embedding size. Each token in an input sentence will be transformed via the word embedding into a vector and since your sentence multiple tokens each sentence will then be represented as a matrix

So if you want use do text classification with embeddings with WekaDeeplearning4j you take the following steps:

- Load your dataset as usual in the "Preprocess" pane
- Select a classifier provided by WekaDeeplearning4j. You've got two options here:
- Select the appropriate instance iterator in the classifier:
  - For the RnnSequenceClassifier select RnnTextEmbeddingInstanceIterator
  - For the Dl4jMlpclassifier select CnnTextEmbeddingInstanceIterator
- In the iterator options set "location of word vectors" to your serialized embedding file (either an ARFF file that you have created previously via Dl4jStringToWord2Vec or e.g. the polyglot-sv.csv embedding)
- Setup the appropriate network architecture and the network configuration
- Run the classifier

If you have any questions regarding this process feel free to ask.

Cheers,
Steven

On Wed, Feb 6, 2019 at 1:43 PM Marina Santini <[hidden email]> wrote:
Great! thanks for the quick reply, Steven. 

I confirm that the Dl4jStringToWord2Vec filter works fine. However I assume that it cannot be used with the FilteredClassifier since I get this error: 

image.png

 

 An old thread (https://wekalist.scms.waikato.ac.narkive.com/QsW3ezSP/class-index-not-set-when-using-filteredclassifier-with-attributeselection-filter-and) says that "The AttributeSelectionFilter does not set a class attribute again if the
attribute selection algorithm is an unsupervised one, since those might
remove the class attribute as well. So, for algorithms derived from the
following classes it will never work in the FilteredClassifier, due to
the missing class attribute:
weka.attributeSelection.UnsupervisedSubsetEvaluator
weka.attributeSelection.UnsupervisedAttributeEvaluator

"

I do not know whether this is the case also for Dl4jStringToWord2Vec. I know that this is NOT the case for the unsupervised filter called stringToWordVector, since I use it in the FilteredClassifier regularly.  

I tried to run Dl4jStringToWord2Vec in the Preprocess tab and saved the transformed dataset (attached here).
When I run any classifier on this dataset (including smo and DI4jMlpClassifier) I get the following error: 

image.png


I am following the instructions for the GUI in https://deeplearning.cms.waikato.ac.nz/ but I think I am missing something. But what?

To make a long story short: can I use word embeddings to get a classification output similar to the one below, where I used a filtered classifier that combines DI4jMlpClassifier and the stringToWordVector filter?


image.png


I will be glad to answer to any additional question.

Thanks in advance. 

Cheers, Marina







On Wed, 6 Feb 2019 at 12:39, Steven Lang <[hidden email]> wrote:
Hi Marina,

I had a more thorough look into this issue and was not able to fit a Glove model as well. 
It turns out that this is an issue with the backend library Deeplearning4j and their Glove model implementation (deadlock during model training as they push the task on a separate thread).
Good news is, that the problem was known and has just been fixed in Deeplearning4j 16 days ago (reference commit: https://github.com/deeplearning4j/deeplearning4j/commit/94c811e9a6e28e15ce17cbc467161028d7f822b1).
This means the Dl4jStringToGlove Filter will be working again as soon as Deeplearning4j publishes their next version.

I'm sorry for the inconvenience. 

The Dl4jStringToWord2Vec filter is not affected by this issue and works as expected.

Cheers,
Steven

On Wed, Feb 6, 2019 at 9:25 AM Marina Santini <[hidden email]> wrote:
Hi again, 

As I mentioned in my previous email, I was about to run a filtered classifier using smo and Dl4jStringToGlove filter. Unfortunately, I never got results. The combination has been running for 24 hours, never going further than the status you see in the picture that I copy below (the weka bird happily scampering on the RHS bottom corner of the screen). 

My guess is that the process is trapped in a loop somewhere. How can I troubleshoot this problem?

The dataset that I am using is relative small, less than 1 million words unevenly distributed over 1040 records. 

Just to have an idea, what would it be the average time to process a dataset of this size using  a filtered classifier that combines smo and Dl4jStringToGlove filter?  

Thanks in advance for your help. 

Cheers, Marina
image.png

On Tue, 5 Feb 2019 at 11:14, Marina Santini <[hidden email]> wrote:
Thanks a lot for your invaluable help, Steven.

Honestly, I would like to test all possible alternatives.

I am currently running a filtered classifier (namely SMO, which is usually much much much faster than any deep learning classifiers) combined with the Dl4jStringToGlove filter (see below), which means that word embeddings are learned from the current dataset. Theoretically, that would be nice, but it seems it takes ages (at least with standard parameters). So if Polyglot's pretrained embeddings will show a competitive performance and a much faster running time, the choice will be easy :-) 

image.png

I will get back with some results.

Thanks a lot again

Bye for now

Marina



On Tue, 5 Feb 2019 at 01:53, steven-lang <[hidden email]> wrote:
Hi Marina,

The Polyglot embeddings are saved as Python pickle files, so I just wrote a
script to convert these *.pkl files into *.csv files which can be used by
our Rnn/CnnTextEmbeddingInstanceIterator. I have just updated the NLP
documentation with conversion instructions
(https://deeplearning.cms.waikato.ac.nz/user-guide/nlp/#embeddings) which
should be online in the next 24 hours.

Until then, you can also use the Python script in the attachment. You need
Python 3.3 or higher with Numpy and Pandas installed.

polyglot_pkl_to_csv.py
<http://weka.8497.n7.nabble.com/file/t6770/polyglot_pkl_to_csv.py

In case you don't have access to a Python environment right now, I have
converted the Swedish embedding for you here:
https://drive.google.com/open?id=1PtIsUHy0fNVpzVqSYDvDtlwB4KMBxROt

The two filters (Dl4jStringToWord2Vec and Dl4jStringToGlove) are for the
case in which you want to learn a new embedding from a given dataset
yourself. But since you specifically mentioned the Polyglot embeddings I
assume that is not what you want to do? Correct me if I'm wrong.

Cheers,
Steven



--
Sent from: http://weka.8497.n7.nabble.com/
_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: WekaDeeplearning4j for text classification in Swedish

steven-lang
Hi Marina,

I tried to reproduce your setup with the embedding you have provided earlier on an arbitrary dataset (Reuters.arff) and did not run into this issue.
Would you mind sending me a sample of your dataset, so I can run the exact same experiment?

Regarding the kernel size in a CNN sequence classification task:

- kernel height is equal to the number of tokens to include into one convolution operation (along token axis)
- kernel width is equal to the size of the embedding subspace to include into one convolution operation (along embedding axis)

That means a convolution layer with kernel size of 5 x 5 would slide a window over 5 tokens (height) along 5 elements (widht) of their embedding vectors.

Also: Which version of WekaDeeplearning4j do you have installed?

Cheers,
Steven


On Thu, Feb 7, 2019 at 10:37 AM Marina Santini <[hidden email]> wrote:
Hi Steven, 

and thanks again for your patience and kindness. 

I tried to apply what you said, and I ran the following scheme: 

 weka.classifiers.functions.Dl4jMlpClassifier -S 1 -cache-mode MEMORY -early-stopping "weka.dl4j.earlystopping.EarlyStopping -maxEpochsNoImprovement 0 -valPercentage 0.0" -normalization "Standardize training data" -iterator "weka.dl4j.iterators.instance.sequence.text.cnn.CnnTextEmbeddingInstanceIterator -stopWords \"weka.dl4j.text.stopwords.Dl4jRainbow \" -tokenPreProcessor \"weka.dl4j.text.tokenization.preprocessor.CommonPreProcessor \" -tokenizerFactory \"weka.dl4j.text.tokenization.tokenizer.factory.DefaultTokenizerFactory \" -truncationLength 100 -wordVectorLocation D:\\zz_myPubs\\2019_benjamin\\suc_datasets\\09Varieties_word2Vec_word_embeddings.arff -bs 1" -iteration-listener "weka.dl4j.listener.EpochListener -eval true -n 5" -layer "weka.dl4j.layers.ConvolutionLayer -nFilters 64 -mode Same -cudnnAlgoMode PREFER_FASTEST -rows 5 -columns 5 -paddingColumns 0 -paddingRows 0 -strideColumns 1 -strideRows 1 -nOut 64 -activation \"weka.dl4j.activations.ActivationIdentity \" -name \"Convolution layer 1\"" -layer "weka.dl4j.layers.ConvolutionLayer -nFilters 32 -mode Same -cudnnAlgoMode PREFER_FASTEST -rows 5 -columns 5 -paddingColumns 0 -paddingRows 0 -strideColumns 1 -strideRows 1 -nOut 32 -activation \"weka.dl4j.activations.ActivationIdentity \" -name \"Convolution layer 2\"" -layer "weka.dl4j.layers.GlobalPoolingLayer -collapseDimensions true -pnorm 2 -poolingType MAX -name \"GlobalPooling layer\"" -layer "weka.dl4j.layers.OutputLayer -lossFn \"weka.dl4j.lossfunctions.LossMCXENT \" -nOut 0 -activation \"weka.dl4j.activations.ActivationSoftmax \" -name \"Output layer\"" -logConfig "weka.core.LogConfiguration -append true -dl4jLogLevel WARN -logFile C:\\Users\\MarinaSantini\\wekafiles\\wekaDeeplearning4j.log -nd4jLogLevel INFO -wekaDl4jLogLevel INFO" -config "weka.dl4j.NeuralNetConfiguration -biasInit 0.0 -biasUpdater \"weka.dl4j.updater.Sgd -lr 0.001 -lrSchedule \\\"weka.dl4j.schedules.ConstantSchedule -scheduleType EPOCH\\\"\" -dist \"weka.dl4j.distribution.Disabled \" -dropout \"weka.dl4j.dropout.Disabled \" -gradientNormalization None -gradNormThreshold 1.0 -l1 NaN -l2 NaN -minimize -algorithm STOCHASTIC_GRADIENT_DESCENT -updater \"weka.dl4j.updater.Adam -beta1MeanDecay 0.9 -beta2VarDecay 0.999 -epsilon 1.0E-8 -lr 0.001 -lrSchedule \\\"weka.dl4j.schedules.ConstantSchedule -scheduleType EPOCH\\\"\" -weightInit XAVIER -weightNoise \"weka.dl4j.weightnoise.Disabled \"" -numEpochs 10 -queueSize 0 -zooModel "weka.dl4j.zoo.CustomNet "

I get this error: 
image.png

I don't know what kernel height means. I know that  theoretically kernel size is same as size of the longest word in the vocabulary. In the scheme, I accepted the default value, ie 5 for both row and column kernel. 

where shall I specify the kernel height and what's the default value for it?

Big thanks
Cheers, Marina


On Wed, 6 Feb 2019 at 18:28, Steven Lang <[hidden email]> wrote:
Hi Marina,

I think your setup is almost complete.
The error you see here justs states, that your ConvolutionLayer require the convolution mode "Same":

2019-02-06-181731_1920x1080_scrot.png

Also keep in mind:
- The convolution layers have to be followed by a single GlobalPoolingLayer
- Set the number of kernel columns to your embedding size 

This is also explained in our documentation.

Cheers,
Steven 



On Wed, Feb 6, 2019 at 5:46 PM Marina Santini <[hidden email]> wrote:
Thanks Steven for the clarification and the detailed explanation. 

However, I do not see how to implement these 2 steps in the GUI: 
 - Use the Dl4jMlpClassifier as CNN   
- Use the RnnSequenceClassifier 

In my GUI I do not see any field that let me specify the 2 steps above. 

This is what I did: I uploaded the string dataset. 
I selected  Dl4jMlpClassifier  and below you can see the window that I have on the screen. I could specify the CnnTextEmbeddingInstanceIterator, as you can see and also the embeddings created with Dl4jStringToWord2Vec filter (an arff file).

image.png


image.png

This is certainly not enough since I get the following error

image.png

Do you think that my WekaDeeplearning4j package is incomplete? I installed it via the Gui package manager. 

Thanks in advance for your help.

Cheers, Marina




On Wed, 6 Feb 2019 at 15:25, Steven Lang <[hidden email]> wrote:
Hi Marina,

I think there is a misunderstanding on what the Dl4jStringToWord2Vec filter (and the Dl4jStringToGlove filter) does. We should be more clear in our documentation! 

The Dl4jStringToWord2Vec filter is used to generate a word embedding from a given dataset. That is, the ouptut of the filter is not the transformed input dataset but the word embedding that has been learned from the input - which you can save and use later on (see explanation below).

> To make a long story short: can I use word embeddings to get a classification output similar to the one below, where I used a filtered classifier that combines DI4jMlpClassifier and the stringToWordVector filter?

Yes you can. But keep in mind that using the StringToWordVector filter to transform each input instance is conceptually different from using an Embedding that is learned with Dl4jStringToWord2Vec/Glove:

- The StringToWordVector filter transforms each input instance into a vector of size M (size of your word dictionary)
- Using an embedding with the Cnn/RnnTextEmbeddingInstanceIterator (see below) will transform each input instance into a matrix of size K x M where K is your maximumum sequence length and M is your embedding size. Each token in an input sentence will be transformed via the word embedding into a vector and since your sentence multiple tokens each sentence will then be represented as a matrix

So if you want use do text classification with embeddings with WekaDeeplearning4j you take the following steps:

- Load your dataset as usual in the "Preprocess" pane
- Select a classifier provided by WekaDeeplearning4j. You've got two options here:
- Select the appropriate instance iterator in the classifier:
  - For the RnnSequenceClassifier select RnnTextEmbeddingInstanceIterator
  - For the Dl4jMlpclassifier select CnnTextEmbeddingInstanceIterator
- In the iterator options set "location of word vectors" to your serialized embedding file (either an ARFF file that you have created previously via Dl4jStringToWord2Vec or e.g. the polyglot-sv.csv embedding)
- Setup the appropriate network architecture and the network configuration
- Run the classifier

If you have any questions regarding this process feel free to ask.

Cheers,
Steven

On Wed, Feb 6, 2019 at 1:43 PM Marina Santini <[hidden email]> wrote:
Great! thanks for the quick reply, Steven. 

I confirm that the Dl4jStringToWord2Vec filter works fine. However I assume that it cannot be used with the FilteredClassifier since I get this error: 

image.png

 

 An old thread (https://wekalist.scms.waikato.ac.narkive.com/QsW3ezSP/class-index-not-set-when-using-filteredclassifier-with-attributeselection-filter-and) says that "The AttributeSelectionFilter does not set a class attribute again if the
attribute selection algorithm is an unsupervised one, since those might
remove the class attribute as well. So, for algorithms derived from the
following classes it will never work in the FilteredClassifier, due to
the missing class attribute:
weka.attributeSelection.UnsupervisedSubsetEvaluator
weka.attributeSelection.UnsupervisedAttributeEvaluator

"

I do not know whether this is the case also for Dl4jStringToWord2Vec. I know that this is NOT the case for the unsupervised filter called stringToWordVector, since I use it in the FilteredClassifier regularly.  

I tried to run Dl4jStringToWord2Vec in the Preprocess tab and saved the transformed dataset (attached here).
When I run any classifier on this dataset (including smo and DI4jMlpClassifier) I get the following error: 

image.png


I am following the instructions for the GUI in https://deeplearning.cms.waikato.ac.nz/ but I think I am missing something. But what?

To make a long story short: can I use word embeddings to get a classification output similar to the one below, where I used a filtered classifier that combines DI4jMlpClassifier and the stringToWordVector filter?


image.png


I will be glad to answer to any additional question.

Thanks in advance. 

Cheers, Marina







On Wed, 6 Feb 2019 at 12:39, Steven Lang <[hidden email]> wrote:
Hi Marina,

I had a more thorough look into this issue and was not able to fit a Glove model as well. 
It turns out that this is an issue with the backend library Deeplearning4j and their Glove model implementation (deadlock during model training as they push the task on a separate thread).
Good news is, that the problem was known and has just been fixed in Deeplearning4j 16 days ago (reference commit: https://github.com/deeplearning4j/deeplearning4j/commit/94c811e9a6e28e15ce17cbc467161028d7f822b1).
This means the Dl4jStringToGlove Filter will be working again as soon as Deeplearning4j publishes their next version.

I'm sorry for the inconvenience. 

The Dl4jStringToWord2Vec filter is not affected by this issue and works as expected.

Cheers,
Steven

On Wed, Feb 6, 2019 at 9:25 AM Marina Santini <[hidden email]> wrote:
Hi again, 

As I mentioned in my previous email, I was about to run a filtered classifier using smo and Dl4jStringToGlove filter. Unfortunately, I never got results. The combination has been running for 24 hours, never going further than the status you see in the picture that I copy below (the weka bird happily scampering on the RHS bottom corner of the screen). 

My guess is that the process is trapped in a loop somewhere. How can I troubleshoot this problem?

The dataset that I am using is relative small, less than 1 million words unevenly distributed over 1040 records. 

Just to have an idea, what would it be the average time to process a dataset of this size using  a filtered classifier that combines smo and Dl4jStringToGlove filter?  

Thanks in advance for your help. 

Cheers, Marina
image.png

On Tue, 5 Feb 2019 at 11:14, Marina Santini <[hidden email]> wrote:
Thanks a lot for your invaluable help, Steven.

Honestly, I would like to test all possible alternatives.

I am currently running a filtered classifier (namely SMO, which is usually much much much faster than any deep learning classifiers) combined with the Dl4jStringToGlove filter (see below), which means that word embeddings are learned from the current dataset. Theoretically, that would be nice, but it seems it takes ages (at least with standard parameters). So if Polyglot's pretrained embeddings will show a competitive performance and a much faster running time, the choice will be easy :-) 

image.png

I will get back with some results.

Thanks a lot again

Bye for now

Marina



On Tue, 5 Feb 2019 at 01:53, steven-lang <[hidden email]> wrote:
Hi Marina,

The Polyglot embeddings are saved as Python pickle files, so I just wrote a
script to convert these *.pkl files into *.csv files which can be used by
our Rnn/CnnTextEmbeddingInstanceIterator. I have just updated the NLP
documentation with conversion instructions
(https://deeplearning.cms.waikato.ac.nz/user-guide/nlp/#embeddings) which
should be online in the next 24 hours.

Until then, you can also use the Python script in the attachment. You need
Python 3.3 or higher with Numpy and Pandas installed.

polyglot_pkl_to_csv.py
<http://weka.8497.n7.nabble.com/file/t6770/polyglot_pkl_to_csv.py

In case you don't have access to a Python environment right now, I have
converted the Swedish embedding for you here:
https://drive.google.com/open?id=1PtIsUHy0fNVpzVqSYDvDtlwB4KMBxROt

The two filters (Dl4jStringToWord2Vec and Dl4jStringToGlove) are for the
case in which you want to learn a new embedding from a given dataset
yourself. But since you specifically mentioned the Polyglot embeddings I
assume that is not what you want to do? Correct me if I'm wrong.

Cheers,
Steven



--
Sent from: http://weka.8497.n7.nabble.com/
_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html