Issues related to Word2Vec

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Issues related to Word2Vec

valerio jus
Hi all,

Kindly I have the following questions:

1- When applying Word2Vec, is the StringToWordVector required to be applied before that?


2- Can the Word2Vec automatically tokenize the sentence into words, so no need to use StringToWordVector?

Any help would be greatly appreciated. 

Thank you. 
Valerio 



_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Issues related to Word2Vec

Felipe Bravo
Hi,
Which implementation of word2vec are you using?
Cheers,
Felipe

On Thu, Jan 24, 2019 at 2:34 PM Valerio jus <[hidden email]> wrote:
Hi all,

Kindly I have the following questions:

1- When applying Word2Vec, is the StringToWordVector required to be applied before that?


2- Can the Word2Vec automatically tokenize the sentence into words, so no need to use StringToWordVector?

Any help would be greatly appreciated. 

Thank you. 
Valerio 


_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Issues related to Word2Vec

valerio jus
Thanks Felipe Bravo, 



Which implementation of word2vec are you using?



I'm using it for text mining. 

Valerio 
Cheers,
Felipe

On Thu, Jan 24, 2019 at 2:34 PM Valerio jus <[hidden email]> wrote:
Hi all,

Kindly I have the following questions:

1- When applying Word2Vec, is the StringToWordVector required to be applied before that?


2- Can the Word2Vec automatically tokenize the sentence into words, so no need to use StringToWordVector?

Any help would be greatly appreciated. 

Thank you. 
Valerio 


_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Issues related to Word2Vec

Felipe Bravo
If you want to train Word2Vec embeddings from withing Weka you need to install the WekaDeepLearning4j package (https://deeplearning.cms.waikato.ac.nz/) and use the Dl4jStringToWord2Vec. This filter can be operated in two main ways.
If option action is set to WORD_VECTOR you will transform your textual dataset into a word embeddings matrix (one word per row, and the embeddings values as attributes). Alternative, you can set action to DOC_VECTOR_AVERAGE, and each document will be represented as the average vector of its words.

If you just want to use pre-trained word embeddings, you can use the AffectiveTweets Package (https://affectivetweets.cms.waikato.ac.nz/). There is a filter called TweetToEmbeddingsFeatureVector that you can use.

Cheers,
Felipe

On Thu, Jan 24, 2019 at 5:37 PM Valerio jus <[hidden email]> wrote:
Thanks Felipe Bravo, 



Which implementation of word2vec are you using?



I'm using it for text mining. 

Valerio 
Cheers,
Felipe

On Thu, Jan 24, 2019 at 2:34 PM Valerio jus <[hidden email]> wrote:
Hi all,

Kindly I have the following questions:

1- When applying Word2Vec, is the StringToWordVector required to be applied before that?


2- Can the Word2Vec automatically tokenize the sentence into words, so no need to use StringToWordVector?

Any help would be greatly appreciated. 

Thank you. 
Valerio 


_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Issues related to Word2Vec

valerio jus
Hi  Felipe, 

Thanks once aging. 

If you want to train Word2Vec embeddings from withing Weka you need to install the WekaDeepLearning4j package (https://deeplearning.cms.waikato.ac.nz/) and use the Dl4jStringToWord2Vec. This filter can be operated in two main ways.
If option action is set to WORD_VECTOR you will transform your textual dataset into a word embeddings matrix (one word per row, and the embeddings values as attributes). Alternative, you can set action to DOC_VECTOR_AVERAGE, and each document will be represented as the average vector of its words.


I have applied the "Dl4jStringToWord2Vec" (action: WORD_VECTOR) from the Classify panel in conjunction with FilteredClassifeir on the WEKA's "ReutersCorn-test" data using the below settings and had the error:  Class index is negative (not set)!


The utilized settings:
weka.classifiers.meta.FilteredClassifier -F "weka.filters.unsupervised.attribute.Dl4jStringToWord2Vec -allowParallelTokenization -batchSize 512 -learningRate 0.025 -minLearningRate 1.0E-4 -negative 0.0 -sampling 0.0 -useHierarchicSoftmax -action WORD_VECTOR -concat_words 15 -embedding_prefix embedding- -epochs 1 -iterations 1 -layerSize 100 -minWordFrequency 5 -preprocessor \"weka.dl4j.text.tokenization.preprocessor.CommonPreProcessor \" -seed 1 -stopWordsHandler \"weka.dl4j.text.stopwords.Dl4jNull \" -index 1 -tokenizerFactory \"weka.dl4j.text.tokenization.tokenizer.factory.DefaultTokenizerFactory \" -windowSize 5 -workers 4" -S 1 -W weka.classifiers.functions.Dl4jMlpClassifier -- -S 1 -cache-mode MEMORY -early-stopping "weka.dl4j.earlystopping.EarlyStopping -maxEpochsNoImprovement 0 -valPercentage 0.0" -normalization "Standardize training data" -iterator "weka.dl4j.iterators.instance.DefaultInstanceIterator -bs 1" -iteration-listener "weka.dl4j.listener.EpochListener -eval true -n 5" -layer "weka.dl4j.layers.OutputLayer -lossFn \"weka.dl4j.lossfunctions.LossMCXENT \" -nOut 0 -activation \"weka.dl4j.activations.ActivationSoftmax \" -name \"Output layer\"" -logConfig "weka.core.LogConfiguration -append true -dl4jLogLevel WARN -logFile C:\\Users\\samersarsam\\wekafiles\\wekaDeeplearning4j.log -nd4jLogLevel INFO -wekaDl4jLogLevel INFO" -config "weka.dl4j.NeuralNetConfiguration -biasInit 0.0 -biasUpdater \"weka.dl4j.updater.Sgd -lr 0.001 -lrSchedule \\\"weka.dl4j.schedules.ConstantSchedule -scheduleType EPOCH\\\"\" -dist \"weka.dl4j.distribution.Disabled \" -dropout \"weka.dl4j.dropout.Disabled \" -gradientNormalization None -gradNormThreshold 1.0 -l1 NaN -l2 NaN -minimize -algorithm STOCHASTIC_GRADIENT_DESCENT -updater \"weka.dl4j.updater.Adam -beta1MeanDecay 0.9 -beta2VarDecay 0.999 -epsilon 1.0E-8 -lr 0.001 -lrSchedule \\\"weka.dl4j.schedules.ConstantSchedule -scheduleType EPOCH\\\"\" -weightInit XAVIER -weightNoise \"weka.dl4j.weightnoise.Disabled \"" -numEpochs 10 -queueSize 0 -zooModel "weka.dl4j.zoo.CustomNet "


How to solve this problem?

Valerio








 
If you just want to use pre-trained word embeddings, you can use the AffectiveTweets Package (https://affectivetweets.cms.waikato.ac.nz/). There is a filter called TweetToEmbeddingsFeatureVector that you can use.

Cheers,
Felipe

On Thu, Jan 24, 2019 at 5:37 PM Valerio jus <[hidden email]> wrote:
Thanks Felipe Bravo, 



Which implementation of word2vec are you using?



I'm using it for text mining. 

Valerio 
Cheers,
Felipe

On Thu, Jan 24, 2019 at 2:34 PM Valerio jus <[hidden email]> wrote:
Hi all,

Kindly I have the following questions:

1- When applying Word2Vec, is the StringToWordVector required to be applied before that?


2- Can the Word2Vec automatically tokenize the sentence into words, so no need to use StringToWordVector?

Any help would be greatly appreciated. 

Thank you. 
Valerio 


_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Issues related to Word2Vec

Felipe Bravo
Hi,
You can't use the word_vector action for document classification because you are transforming your instance space (your instances are words instead of documents after running the filter). If you want to use Word2Vec inside of a FilteredClassifier scheme you will have to set action to DOC_VECTOR_AVERAGE.  Moreover, since the filter doesn't remove the String attribute with the document content, you should also include the RemoveByType Filter into a MultiFilter pipeline.

Example is given below:

weka.classifiers.meta.FilteredClassifier -F "weka.filters.MultiFilter -F \"weka.filters.unsupervised.attribute.Dl4jStringToWord2Vec -allowParallelTokenization -batchSize 512 -learningRate 0.025 -minLearningRate 1.0E-4 -negative 0.0 -sampling 0.0 -useHierarchicSoftmax -action DOC_VECTOR_AVERAGE -concat_words 15 -embedding_prefix embedding- -epochs 4 -iterations 1 -layerSize 100 -minWordFrequency 5 -preprocessor \\\"weka.dl4j.text.tokenization.preprocessor.CommonPreProcessor \\\" -seed 1 -stopWordsHandler \\\"weka.dl4j.text.stopwords.Dl4jNull \\\" -index 1 -tokenizerFactory \\\"weka.dl4j.text.tokenization.tokenizer.factory.DefaultTokenizerFactory \\\" -windowSize 5 -workers 6\" -F \"weka.filters.unsupervised.attribute.RemoveType -T string\"" -S 1 -W weka.classifiers.functions.Dl4jMlpClassifier -- -S 1 -cache-mode MEMORY -early-stopping "weka.dl4j.earlystopping.EarlyStopping -maxEpochsNoImprovement 0 -valPercentage 0.0" -normalization "Standardize training data" -iterator "weka.dl4j.iterators.instance.DefaultInstanceIterator -bs 1" -iteration-listener "weka.dl4j.listener.EpochListener -eval true -n 5" -layer "weka.dl4j.layers.OutputLayer -lossFn \"weka.dl4j.lossfunctions.LossMCXENT \" -nOut 0 -activation \"weka.dl4j.activations.ActivationSoftmax \" -name \"Output layer\"" -logConfig "weka.core.LogConfiguration -append true -dl4jLogLevel WARN -logFile /home/fbravoma/wekafiles/wekaDeeplearning4j.log -nd4jLogLevel INFO -wekaDl4jLogLevel INFO" -config "weka.dl4j.NeuralNetConfiguration -biasInit 0.0 -biasUpdater \"weka.dl4j.updater.Sgd -lr 0.001 -lrSchedule \\\"weka.dl4j.schedules.ConstantSchedule -scheduleType EPOCH\\\"\" -dist \"weka.dl4j.distribution.Disabled \" -dropout \"weka.dl4j.dropout.Disabled \" -gradientNormalization None -gradNormThreshold 1.0 -l1 NaN -l2 NaN -minimize -algorithm STOCHASTIC_GRADIENT_DESCENT -updater \"weka.dl4j.updater.Adam -beta1MeanDecay 0.9 -beta2VarDecay 0.999 -epsilon 1.0E-8 -lr 0.001 -lrSchedule \\\"weka.dl4j.schedules.ConstantSchedule -scheduleType EPOCH\\\"\" -weightInit XAVIER -weightNoise \"weka.dl4j.weightnoise.Disabled \"" -numEpochs 10 -queueSize 0 -zooModel "weka.dl4j.zoo.CustomNet "



Results are very poor. You need a larger corpus to train decent word embeddings.
Cheers,
Felipe

On Fri, Jan 25, 2019 at 4:07 PM Valerio jus <[hidden email]> wrote:
Hi  Felipe, 

Thanks once aging. 

If you want to train Word2Vec embeddings from withing Weka you need to install the WekaDeepLearning4j package (https://deeplearning.cms.waikato.ac.nz/) and use the Dl4jStringToWord2Vec. This filter can be operated in two main ways.
If option action is set to WORD_VECTOR you will transform your textual dataset into a word embeddings matrix (one word per row, and the embeddings values as attributes). Alternative, you can set action to DOC_VECTOR_AVERAGE, and each document will be represented as the average vector of its words.


I have applied the "Dl4jStringToWord2Vec" (action: WORD_VECTOR) from the Classify panel in conjunction with FilteredClassifeir on the WEKA's "ReutersCorn-test" data using the below settings and had the error:  Class index is negative (not set)!


The utilized settings:
weka.classifiers.meta.FilteredClassifier -F "weka.filters.unsupervised.attribute.Dl4jStringToWord2Vec -allowParallelTokenization -batchSize 512 -learningRate 0.025 -minLearningRate 1.0E-4 -negative 0.0 -sampling 0.0 -useHierarchicSoftmax -action WORD_VECTOR -concat_words 15 -embedding_prefix embedding- -epochs 1 -iterations 1 -layerSize 100 -minWordFrequency 5 -preprocessor \"weka.dl4j.text.tokenization.preprocessor.CommonPreProcessor \" -seed 1 -stopWordsHandler \"weka.dl4j.text.stopwords.Dl4jNull \" -index 1 -tokenizerFactory \"weka.dl4j.text.tokenization.tokenizer.factory.DefaultTokenizerFactory \" -windowSize 5 -workers 4" -S 1 -W weka.classifiers.functions.Dl4jMlpClassifier -- -S 1 -cache-mode MEMORY -early-stopping "weka.dl4j.earlystopping.EarlyStopping -maxEpochsNoImprovement 0 -valPercentage 0.0" -normalization "Standardize training data" -iterator "weka.dl4j.iterators.instance.DefaultInstanceIterator -bs 1" -iteration-listener "weka.dl4j.listener.EpochListener -eval true -n 5" -layer "weka.dl4j.layers.OutputLayer -lossFn \"weka.dl4j.lossfunctions.LossMCXENT \" -nOut 0 -activation \"weka.dl4j.activations.ActivationSoftmax \" -name \"Output layer\"" -logConfig "weka.core.LogConfiguration -append true -dl4jLogLevel WARN -logFile C:\\Users\\samersarsam\\wekafiles\\wekaDeeplearning4j.log -nd4jLogLevel INFO -wekaDl4jLogLevel INFO" -config "weka.dl4j.NeuralNetConfiguration -biasInit 0.0 -biasUpdater \"weka.dl4j.updater.Sgd -lr 0.001 -lrSchedule \\\"weka.dl4j.schedules.ConstantSchedule -scheduleType EPOCH\\\"\" -dist \"weka.dl4j.distribution.Disabled \" -dropout \"weka.dl4j.dropout.Disabled \" -gradientNormalization None -gradNormThreshold 1.0 -l1 NaN -l2 NaN -minimize -algorithm STOCHASTIC_GRADIENT_DESCENT -updater \"weka.dl4j.updater.Adam -beta1MeanDecay 0.9 -beta2VarDecay 0.999 -epsilon 1.0E-8 -lr 0.001 -lrSchedule \\\"weka.dl4j.schedules.ConstantSchedule -scheduleType EPOCH\\\"\" -weightInit XAVIER -weightNoise \"weka.dl4j.weightnoise.Disabled \"" -numEpochs 10 -queueSize 0 -zooModel "weka.dl4j.zoo.CustomNet "


How to solve this problem?

Valerio








 
If you just want to use pre-trained word embeddings, you can use the AffectiveTweets Package (https://affectivetweets.cms.waikato.ac.nz/). There is a filter called TweetToEmbeddingsFeatureVector that you can use.

Cheers,
Felipe

On Thu, Jan 24, 2019 at 5:37 PM Valerio jus <[hidden email]> wrote:
Thanks Felipe Bravo, 



Which implementation of word2vec are you using?



I'm using it for text mining. 

Valerio 
Cheers,
Felipe

On Thu, Jan 24, 2019 at 2:34 PM Valerio jus <[hidden email]> wrote:
Hi all,

Kindly I have the following questions:

1- When applying Word2Vec, is the StringToWordVector required to be applied before that?


2- Can the Word2Vec automatically tokenize the sentence into words, so no need to use StringToWordVector?

Any help would be greatly appreciated. 

Thank you. 
Valerio 


_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Issues related to Word2Vec

valerio jus
Thanks a lot Felipe. The settings you provided worked well. 

Kind regards,
Valerio



On Fri, Jan 25, 2019 at 11:25 AM Felipe Bravo <[hidden email]> wrote:
Hi,
You can't use the word_vector action for document classification because you are transforming your instance space (your instances are words instead of documents after running the filter). If you want to use Word2Vec inside of a FilteredClassifier scheme you will have to set action to DOC_VECTOR_AVERAGE.  Moreover, since the filter doesn't remove the String attribute with the document content, you should also include the RemoveByType Filter into a MultiFilter pipeline.

Example is given below:

weka.classifiers.meta.FilteredClassifier -F "weka.filters.MultiFilter -F \"weka.filters.unsupervised.attribute.Dl4jStringToWord2Vec -allowParallelTokenization -batchSize 512 -learningRate 0.025 -minLearningRate 1.0E-4 -negative 0.0 -sampling 0.0 -useHierarchicSoftmax -action DOC_VECTOR_AVERAGE -concat_words 15 -embedding_prefix embedding- -epochs 4 -iterations 1 -layerSize 100 -minWordFrequency 5 -preprocessor \\\"weka.dl4j.text.tokenization.preprocessor.CommonPreProcessor \\\" -seed 1 -stopWordsHandler \\\"weka.dl4j.text.stopwords.Dl4jNull \\\" -index 1 -tokenizerFactory \\\"weka.dl4j.text.tokenization.tokenizer.factory.DefaultTokenizerFactory \\\" -windowSize 5 -workers 6\" -F \"weka.filters.unsupervised.attribute.RemoveType -T string\"" -S 1 -W weka.classifiers.functions.Dl4jMlpClassifier -- -S 1 -cache-mode MEMORY -early-stopping "weka.dl4j.earlystopping.EarlyStopping -maxEpochsNoImprovement 0 -valPercentage 0.0" -normalization "Standardize training data" -iterator "weka.dl4j.iterators.instance.DefaultInstanceIterator -bs 1" -iteration-listener "weka.dl4j.listener.EpochListener -eval true -n 5" -layer "weka.dl4j.layers.OutputLayer -lossFn \"weka.dl4j.lossfunctions.LossMCXENT \" -nOut 0 -activation \"weka.dl4j.activations.ActivationSoftmax \" -name \"Output layer\"" -logConfig "weka.core.LogConfiguration -append true -dl4jLogLevel WARN -logFile /home/fbravoma/wekafiles/wekaDeeplearning4j.log -nd4jLogLevel INFO -wekaDl4jLogLevel INFO" -config "weka.dl4j.NeuralNetConfiguration -biasInit 0.0 -biasUpdater \"weka.dl4j.updater.Sgd -lr 0.001 -lrSchedule \\\"weka.dl4j.schedules.ConstantSchedule -scheduleType EPOCH\\\"\" -dist \"weka.dl4j.distribution.Disabled \" -dropout \"weka.dl4j.dropout.Disabled \" -gradientNormalization None -gradNormThreshold 1.0 -l1 NaN -l2 NaN -minimize -algorithm STOCHASTIC_GRADIENT_DESCENT -updater \"weka.dl4j.updater.Adam -beta1MeanDecay 0.9 -beta2VarDecay 0.999 -epsilon 1.0E-8 -lr 0.001 -lrSchedule \\\"weka.dl4j.schedules.ConstantSchedule -scheduleType EPOCH\\\"\" -weightInit XAVIER -weightNoise \"weka.dl4j.weightnoise.Disabled \"" -numEpochs 10 -queueSize 0 -zooModel "weka.dl4j.zoo.CustomNet "



Results are very poor. You need a larger corpus to train decent word embeddings.
Cheers,
Felipe

On Fri, Jan 25, 2019 at 4:07 PM Valerio jus <[hidden email]> wrote:
Hi  Felipe, 

Thanks once aging. 

If you want to train Word2Vec embeddings from withing Weka you need to install the WekaDeepLearning4j package (https://deeplearning.cms.waikato.ac.nz/) and use the Dl4jStringToWord2Vec. This filter can be operated in two main ways.
If option action is set to WORD_VECTOR you will transform your textual dataset into a word embeddings matrix (one word per row, and the embeddings values as attributes). Alternative, you can set action to DOC_VECTOR_AVERAGE, and each document will be represented as the average vector of its words.


I have applied the "Dl4jStringToWord2Vec" (action: WORD_VECTOR) from the Classify panel in conjunction with FilteredClassifeir on the WEKA's "ReutersCorn-test" data using the below settings and had the error:  Class index is negative (not set)!


The utilized settings:
weka.classifiers.meta.FilteredClassifier -F "weka.filters.unsupervised.attribute.Dl4jStringToWord2Vec -allowParallelTokenization -batchSize 512 -learningRate 0.025 -minLearningRate 1.0E-4 -negative 0.0 -sampling 0.0 -useHierarchicSoftmax -action WORD_VECTOR -concat_words 15 -embedding_prefix embedding- -epochs 1 -iterations 1 -layerSize 100 -minWordFrequency 5 -preprocessor \"weka.dl4j.text.tokenization.preprocessor.CommonPreProcessor \" -seed 1 -stopWordsHandler \"weka.dl4j.text.stopwords.Dl4jNull \" -index 1 -tokenizerFactory \"weka.dl4j.text.tokenization.tokenizer.factory.DefaultTokenizerFactory \" -windowSize 5 -workers 4" -S 1 -W weka.classifiers.functions.Dl4jMlpClassifier -- -S 1 -cache-mode MEMORY -early-stopping "weka.dl4j.earlystopping.EarlyStopping -maxEpochsNoImprovement 0 -valPercentage 0.0" -normalization "Standardize training data" -iterator "weka.dl4j.iterators.instance.DefaultInstanceIterator -bs 1" -iteration-listener "weka.dl4j.listener.EpochListener -eval true -n 5" -layer "weka.dl4j.layers.OutputLayer -lossFn \"weka.dl4j.lossfunctions.LossMCXENT \" -nOut 0 -activation \"weka.dl4j.activations.ActivationSoftmax \" -name \"Output layer\"" -logConfig "weka.core.LogConfiguration -append true -dl4jLogLevel WARN -logFile C:\\Users\\samersarsam\\wekafiles\\wekaDeeplearning4j.log -nd4jLogLevel INFO -wekaDl4jLogLevel INFO" -config "weka.dl4j.NeuralNetConfiguration -biasInit 0.0 -biasUpdater \"weka.dl4j.updater.Sgd -lr 0.001 -lrSchedule \\\"weka.dl4j.schedules.ConstantSchedule -scheduleType EPOCH\\\"\" -dist \"weka.dl4j.distribution.Disabled \" -dropout \"weka.dl4j.dropout.Disabled \" -gradientNormalization None -gradNormThreshold 1.0 -l1 NaN -l2 NaN -minimize -algorithm STOCHASTIC_GRADIENT_DESCENT -updater \"weka.dl4j.updater.Adam -beta1MeanDecay 0.9 -beta2VarDecay 0.999 -epsilon 1.0E-8 -lr 0.001 -lrSchedule \\\"weka.dl4j.schedules.ConstantSchedule -scheduleType EPOCH\\\"\" -weightInit XAVIER -weightNoise \"weka.dl4j.weightnoise.Disabled \"" -numEpochs 10 -queueSize 0 -zooModel "weka.dl4j.zoo.CustomNet "


How to solve this problem?

Valerio








 
If you just want to use pre-trained word embeddings, you can use the AffectiveTweets Package (https://affectivetweets.cms.waikato.ac.nz/). There is a filter called TweetToEmbeddingsFeatureVector that you can use.

Cheers,
Felipe

On Thu, Jan 24, 2019 at 5:37 PM Valerio jus <[hidden email]> wrote:
Thanks Felipe Bravo, 



Which implementation of word2vec are you using?



I'm using it for text mining. 

Valerio 
Cheers,
Felipe

On Thu, Jan 24, 2019 at 2:34 PM Valerio jus <[hidden email]> wrote:
Hi all,

Kindly I have the following questions:

1- When applying Word2Vec, is the StringToWordVector required to be applied before that?


2- Can the Word2Vec automatically tokenize the sentence into words, so no need to use StringToWordVector?

Any help would be greatly appreciated. 

Thank you. 
Valerio 


_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html