Different Results for StringToWordVector between UI and Java

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Different Results for StringToWordVector between UI and Java

Etaliken
This post was updated on .
Hello,

I used the "StringToWordVector"-Filter on my Dataset with the following
Parameters:

weka.filters.unsupervised.attribute.StringToWordVector -R first-last -W
10000 -prune-rate -1.0 -N 0 -stemmer weka.core.stemmers.NullStemmer
-stopwords-handler weka.core.stopwords.Null -M 1 -tokenizer
"weka.core.tokenizers.CharacterNGramTokenizer -max 4 -min 1"

For my Dataset i get 54296 Attributes in WEKA UI. I trained my model with
these parameters and now wanted to use my model in JAVA and predict new
data.

This is my Java Code:

StringToWordVector filter = new StringToWordVector();
filter.setWordsToKeep(10000);
filter.setLowerCaseTokens(true);
filter.setOutputWordCounts(false);
filter.setTFTransform(false);
filter.setIDFTransform(false);
filter.setAttributeIndices("first-last");
CharacterNGramTokenizer tokenizer = new CharacterNGramTokenizer();
tokenizer.setNGramMinSize(1);
tokenizer.setNGramMaxSize(4);
filter.setTokenizer(tokenizer);
filter.setInputFormat(trainedData);
Instances output = Filter.useFilter(data,filter);

However my "output" has only 22529 Attributes which results in an error
using my pretrained Model.

What do i missing?

Kind regards




--
Sent from: http://weka.8497.n7.nabble.com/
_______________________________________________
Wekalist mailing list
Send posts to: Wekalist@list.waikato.ac.nz
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Different Results for StringToWordVector between UI and Java

Eibe Frank-2
Administrator
You don’t have -L (for lower case output) in the configuration.

This might be the reason, assuming you use exactly the same dataset in both cases.

Note quite sure what you mean by “results in an error using my pretrained model”?

Cheers,
Eibe

> On 29/01/2019, at 8:50 PM, Etaliken <[hidden email]> wrote:
>
> Hello,
>
> I used the "StringToWordVector"-Filter on my Dataset with the following
> Parameters:
>
> weka.filters.unsupervised.attribute.StringToWordVector -R first-last -W
> 10000 -prune-rate -1.0 -N 0 -stemmer weka.core.stemmers.NullStemmer
> -stopwords-handler weka.core.stopwords.Null -M 1 -tokenizer
> "weka.core.tokenizers.CharacterNGramTokenizer -max 4 -min 1"
>
> For my Dataset i get 54296 Attributes in WEKA UI. I trained my model with
> these parameters and now wanted to use my model in JAVA and predict new
> data.
>
> This is my Java Code:
>
> StringToWordVector filter = new StringToWordVector();
> filter.setWordsToKeep(10000);
> filter.setLowerCaseTokens(true);
> filter.setOutputWordCounts(false);
> filter.setTFTransform(false);
> filter.setIDFTransform(false);
> filter.setAttributeIndices("first-last");
> NGramTokenizer tokenizer = new NGramTokenizer();
> tokenizer.setNGramMinSize(1);
> tokenizer.setNGramMaxSize(4);
> filter.setTokenizer(tokenizer);
> filter.setInputFormat(trainedData);
> Instances output = Filter.useFilter(data,filter);
>
> However my "output" has only 22529 Attributes which results in an error
> using my pretrained Model.
>
> What do i missing?
>
> Kind regards
>
>
>
>
> --
> Sent from: http://weka.8497.n7.nabble.com/
> _______________________________________________
> Wekalist mailing list
> Send posts to: [hidden email]
> To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
> List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html