How to solve this Weka Classifier/StringtoWordVector problem

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

How to solve this Weka Classifier/StringtoWordVector problem

paul.bauriegel
Hi,
I have created a simple Classifier for the Reuters Corn Data using the Weka
Explorer.
I also did some pre-processing of the data:
weka.filters.unsupervised.attribute.StringToWordVector -R first-last -W 1000
-prune-rate -1.0 -T -I -N 0 -L -stemmer weka.core.stemmers.SnowballStemmer
-stopwords-handler weka.core.stopwords.Null -M 1 -tokenizer
"weka.core.tokenizers.WordTokenizer -delimiters \"
\\r\\n\\t.,;:\\\'\\\"()?!\""

Now I want to use the Weka model for predicting new unseen text snippets. I
build the code based on the provided wiki examples.
Creating a Instance: https://waikato.github.io/weka-wiki/creating_arff_file/
Applying StringToWordVector:
https://stackoverflow.com/questions/41821762/load-naive-bays-model-in-java-c
ode-using-weka-jar/41832576#41832576
Deserialize the Model:
https://waikato.github.io/weka-wiki/serialization/#deserializing
And Classifying the Instance.
https://waikato.github.io/weka-wiki/use_weka_in_your_java_code/#clustering-i
nstances

However the Code does always predict the same class no matter how the input
looks like. I assume this is because the StringToWordVector filter returns
always {0 1} for the sample snippet. But I cannot figure out why:

StringToWordVector stv = new StringToWordVector();
stv.setInputFormat(trainingData);
stv.setIDFTransform(true);
stv.setTFTransform(true);
stv.setLowerCaseTokens(true);
SnowballStemmer stemmer = new SnowballStemmer();
stv.setStemmer(stemmer);
stv.setDoNotOperateOnPerClassBasis(true);
Instances filtered = Filter.useFilter(newData, stv);
filtered.setClassIndex(0);

Classifier
(https://github.com/paulbauriegel/text-classification-weka/blob/master/data/
corn.model?raw=true)
Complete Code
(https://raw.githubusercontent.com/paulbauriegel/text-classification-weka/ma
ster/src/de/qaass/classifier/TextClassTest.java)

Could someone take a short look at my code and figure out what I am doing
wrong? Thanks in advance.

_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: How to solve this Weka Classifier/StringtoWordVector problem

Mark Hall


On 19/12/18, 8:21 AM, "[hidden email] on behalf of [hidden email]" <[hidden email] on behalf of [hidden email]> wrote:

    Hi,
    I have created a simple Classifier for the Reuters Corn Data using the Weka
    Explorer.
    I also did some pre-processing of the data:
    weka.filters.unsupervised.attribute.StringToWordVector -R first-last -W 1000
    -prune-rate -1.0 -T -I -N 0 -L -stemmer weka.core.stemmers.SnowballStemmer
    -stopwords-handler weka.core.stopwords.Null -M 1 -tokenizer
    "weka.core.tokenizers.WordTokenizer -delimiters \"
    \\r\\n\\t.,;:\\\'\\\"()?!\""
   
    Now I want to use the Weka model for predicting new unseen text snippets. I
    build the code based on the provided wiki examples.
    Creating a Instance: https://waikato.github.io/weka-wiki/creating_arff_file/
    Applying StringToWordVector:
    https://stackoverflow.com/questions/41821762/load-naive-bays-model-in-java-c
    ode-using-weka-jar/41832576#41832576
    Deserialize the Model:
    https://waikato.github.io/weka-wiki/serialization/#deserializing
    And Classifying the Instance.
    https://waikato.github.io/weka-wiki/use_weka_in_your_java_code/#clustering-i
    nstances
   
    However the Code does always predict the same class no matter how the input
    looks like. I assume this is because the StringToWordVector filter returns
    always {0 1} for the sample snippet. But I cannot figure out why:
   
    StringToWordVector stv = new StringToWordVector();
    stv.setInputFormat(trainingData);
    stv.setIDFTransform(true);
    stv.setTFTransform(true);
    stv.setLowerCaseTokens(true);
    SnowballStemmer stemmer = new SnowballStemmer();
    stv.setStemmer(stemmer);
    stv.setDoNotOperateOnPerClassBasis(true);
    Instances filtered = Filter.useFilter(newData, stv);
    filtered.setClassIndex(0);
   
    Classifier
    (https://github.com/paulbauriegel/text-classification-weka/blob/master/data/
    corn.model?raw=true)
    Complete Code
    (https://raw.githubusercontent.com/paulbauriegel/text-classification-weka/ma
    ster/src/de/qaass/classifier/TextClassTest.java)
   
    Could someone take a short look at my code and figure out what I am doing
    wrong? Thanks in advance.


The problem is that you are creating different dictionaries for both the training data and the test data using separate invocations of the StringToWordVector filter. Generation of the dictionary needs to be part of the training process, and then applied to the test data. You can accomplish this by using a FilteredClassifier. The FilteredClassifier will apply the "trained" StringToWordVector filter on the test data to vectorize test instances into the same space as the training data.

Cheers,
Mark.
   
_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: How to solve this Weka Classifier/StringtoWordVector problem

paul.bauriegel
Thanks a lot Mark! I tried your approach and everything works as expected.
However, one thing about the weka filters I'm still  a bit confused about. What I did try to implement before was based on this Batch Filtering Feature (https://waikato.github.io/weka-wiki/use_weka_in_your_java_code/#batch-filtering). As I understood the Feature setInputFormat(trainingData) should enable the processing of a second file with the same dictionary as used for the trainingData.


-----Original Message-----
From: [hidden email] <[hidden email]> On Behalf Of Mark Hall
Sent: Dienstag, 18. Dezember 2018 20:38
To: Weka machine learning workbench list. <[hidden email]>
Subject: Re: [Wekalist] How to solve this Weka Classifier/StringtoWordVector problem



On 19/12/18, 8:21 AM, "[hidden email] on behalf of [hidden email]" <[hidden email] on behalf of [hidden email]> wrote:

    Hi,
    I have created a simple Classifier for the Reuters Corn Data using the Weka
    Explorer.
    I also did some pre-processing of the data:
    weka.filters.unsupervised.attribute.StringToWordVector -R first-last -W 1000
    -prune-rate -1.0 -T -I -N 0 -L -stemmer weka.core.stemmers.SnowballStemmer
    -stopwords-handler weka.core.stopwords.Null -M 1 -tokenizer
    "weka.core.tokenizers.WordTokenizer -delimiters \"
    \\r\\n\\t.,;:\\\'\\\"()?!\""
   
    Now I want to use the Weka model for predicting new unseen text snippets. I
    build the code based on the provided wiki examples.
    Creating a Instance: https://waikato.github.io/weka-wiki/creating_arff_file/
    Applying StringToWordVector:
    https://stackoverflow.com/questions/41821762/load-naive-bays-model-in-java-c
    ode-using-weka-jar/41832576#41832576
    Deserialize the Model:
    https://waikato.github.io/weka-wiki/serialization/#deserializing
    And Classifying the Instance.
    https://waikato.github.io/weka-wiki/use_weka_in_your_java_code/#clustering-i
    nstances
   
    However the Code does always predict the same class no matter how the input
    looks like. I assume this is because the StringToWordVector filter returns
    always {0 1} for the sample snippet. But I cannot figure out why:
   
    StringToWordVector stv = new StringToWordVector();
    stv.setInputFormat(trainingData);
    stv.setIDFTransform(true);
    stv.setTFTransform(true);
    stv.setLowerCaseTokens(true);
    SnowballStemmer stemmer = new SnowballStemmer();
    stv.setStemmer(stemmer);
    stv.setDoNotOperateOnPerClassBasis(true);
    Instances filtered = Filter.useFilter(newData, stv);
    filtered.setClassIndex(0);
   
    Classifier
    (https://github.com/paulbauriegel/text-classification-weka/blob/master/data/
    corn.model?raw=true)
    Complete Code
    (https://raw.githubusercontent.com/paulbauriegel/text-classification-weka/ma
    ster/src/de/qaass/classifier/TextClassTest.java)
   
    Could someone take a short look at my code and figure out what I am doing
    wrong? Thanks in advance.


The problem is that you are creating different dictionaries for both the training data and the test data using separate invocations of the StringToWordVector filter. Generation of the dictionary needs to be part of the training process, and then applied to the test data. You can accomplish this by using a FilteredClassifier. The FilteredClassifier will apply the "trained" StringToWordVector filter on the test data to vectorize test instances into the same space as the training data.

Cheers,
Mark.
   
_______________________________________________
Wekalist mailing list
Send posts to: [hidden email] To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: How to solve this Weka Classifier/StringtoWordVector problem

Eibe Frank-3
The bit of documentation you read might have been misleading. The setInputFormat() resets the filter and configures it to process data based on the format defined by the Instances object provided as the argument.

Cheers,
Eibe

On Thu, 20 Dec 2018 at 7:41 PM, <[hidden email]> wrote:
Thanks a lot Mark! I tried your approach and everything works as expected.
However, one thing about the weka filters I'm still  a bit confused about. What I did try to implement before was based on this Batch Filtering Feature (https://waikato.github.io/weka-wiki/use_weka_in_your_java_code/#batch-filtering). As I understood the Feature setInputFormat(trainingData) should enable the processing of a second file with the same dictionary as used for the trainingData.


-----Original Message-----
From: [hidden email] <[hidden email]> On Behalf Of Mark Hall
Sent: Dienstag, 18. Dezember 2018 20:38
To: Weka machine learning workbench list. <[hidden email]>
Subject: Re: [Wekalist] How to solve this Weka Classifier/StringtoWordVector problem



On 19/12/18, 8:21 AM, "[hidden email] on behalf of [hidden email]" <[hidden email] on behalf of [hidden email]> wrote:

    Hi,
    I have created a simple Classifier for the Reuters Corn Data using the Weka
    Explorer.
    I also did some pre-processing of the data:
    weka.filters.unsupervised.attribute.StringToWordVector -R first-last -W 1000
    -prune-rate -1.0 -T -I -N 0 -L -stemmer weka.core.stemmers.SnowballStemmer
    -stopwords-handler weka.core.stopwords.Null -M 1 -tokenizer
    "weka.core.tokenizers.WordTokenizer -delimiters \"
    \\r\\n\\t.,;:\\\'\\\"()?!\""

    Now I want to use the Weka model for predicting new unseen text snippets. I
    build the code based on the provided wiki examples.
    Creating a Instance: https://waikato.github.io/weka-wiki/creating_arff_file/
    Applying StringToWordVector:
    https://stackoverflow.com/questions/41821762/load-naive-bays-model-in-java-c
    ode-using-weka-jar/41832576#41832576
    Deserialize the Model:
    https://waikato.github.io/weka-wiki/serialization/#deserializing
    And Classifying the Instance.
    https://waikato.github.io/weka-wiki/use_weka_in_your_java_code/#clustering-i
    nstances

    However the Code does always predict the same class no matter how the input
    looks like. I assume this is because the StringToWordVector filter returns
    always {0 1} for the sample snippet. But I cannot figure out why:

    StringToWordVector stv = new StringToWordVector();
    stv.setInputFormat(trainingData);
    stv.setIDFTransform(true);
    stv.setTFTransform(true);
    stv.setLowerCaseTokens(true);
    SnowballStemmer stemmer = new SnowballStemmer();
    stv.setStemmer(stemmer);
    stv.setDoNotOperateOnPerClassBasis(true);
    Instances filtered = Filter.useFilter(newData, stv);
    filtered.setClassIndex(0);

    Classifier
    (https://github.com/paulbauriegel/text-classification-weka/blob/master/data/
    corn.model?raw=true)
    Complete Code
    (https://raw.githubusercontent.com/paulbauriegel/text-classification-weka/ma
    ster/src/de/qaass/classifier/TextClassTest.java)

    Could someone take a short look at my code and figure out what I am doing
    wrong? Thanks in advance.


The problem is that you are creating different dictionaries for both the training data and the test data using separate invocations of the StringToWordVector filter. Generation of the dictionary needs to be part of the training process, and then applied to the test data. You can accomplish this by using a FilteredClassifier. The FilteredClassifier will apply the "trained" StringToWordVector filter on the test data to vectorize test instances into the same space as the training data.

Cheers,
Mark.

_______________________________________________
Wekalist mailing list
Send posts to: [hidden email] To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html