StringToVector filter, minimum term frequency

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

StringToVector filter, minimum term frequency

Marina Santini
Hi, 

I have a question about the StringToVector filter. I am using this filter in a FilteredClassifier context. 

I need the list of the words that have been actually used in the classification task. 
For this reason, I set the following parameters: 

image.png

Since I set a MinTermFreq to 5, I expected to find words with a frequency = or > 5 in the dictionary file (that I called eCare_WordList_5-10000). Instead this file contains words with a frequency lower than 5, eg. 

image.png

How is that? 

Thanks in advance for your answer. 

Best wishes
Marina

_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: StringToVector filter, minimum term frequency

Eibe Frank-2
Administrator
Looking at the code that generates this output, the number in the second column appears to be the *document* frequency, not the term frequency.

If you want to see which words are used by the classifier, take a look at the textual description corresponding to the trained FilteredClassifier object. It should have a list of all the words that are used, as corresponding @attribute entries. Alternatively, use a classifier such as NaiveBayesMultinomial, which does not discard any attributes, and take a look at the textual description of the trained model.

Cheers,
Eibe

On Sun, Sep 22, 2019 at 10:51 PM Marina Santini <[hidden email]> wrote:
Hi, 

I have a question about the StringToVector filter. I am using this filter in a FilteredClassifier context. 

I need the list of the words that have been actually used in the classification task. 
For this reason, I set the following parameters: 

image.png

Since I set a MinTermFreq to 5, I expected to find words with a frequency = or > 5 in the dictionary file (that I called eCare_WordList_5-10000). Instead this file contains words with a frequency lower than 5, eg. 

image.png

How is that? 

Thanks in advance for your answer. 

Best wishes
Marina
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: StringToVector filter, minimum term frequency

Marina Santini
Thanks, Eibe, for your quick reply. 

I have checked the @attribute list in the output and there I have 5423 attributes, although I specified that 4500 in the WordsToKeep field. I reckon the classifier should use only 4500 words. Why are these number different?

I wonder also why the number of attribute "weights" is 5193, i.e. neither 5423 nor 4500. Could you please explain why is that?

Also, I set the option called "outputWordCount" to True. I thought that the wordCounts were indicated in it the file specified in the field "dictionaryFileSaveTo". But in your previous reply you say that the number in the second column indicates the document frequency and not the word frequency: where I can check the word frequency then? Where does " outputWordCount " store the output?

I attach my output for reference. 

Thanks in advance.

Cheers, Marina



On Sun, 22 Sep 2019 at 13:22, Eibe Frank <[hidden email]> wrote:
Looking at the code that generates this output, the number in the second column appears to be the *document* frequency, not the term frequency.

If you want to see which words are used by the classifier, take a look at the textual description corresponding to the trained FilteredClassifier object. It should have a list of all the words that are used, as corresponding @attribute entries. Alternatively, use a classifier such as NaiveBayesMultinomial, which does not discard any attributes, and take a look at the textual description of the trained model.

Cheers,
Eibe

On Sun, Sep 22, 2019 at 10:51 PM Marina Santini <[hidden email]> wrote:
Hi, 

I have a question about the StringToVector filter. I am using this filter in a FilteredClassifier context. 

I need the list of the words that have been actually used in the classification task. 
For this reason, I set the following parameters: 

image.png

Since I set a MinTermFreq to 5, I expected to find words with a frequency = or > 5 in the dictionary file (that I called eCare_WordList_5-10000). Instead this file contains words with a frequency lower than 5, eg. 

image.png

How is that? 

Thanks in advance for your answer. 

Best wishes
Marina
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

eCare_SMO_output_5-4500.txt (483K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: StringToVector filter, minimum term frequency

Eibe Frank-2
Administrator
The StringToWordVector filter does not break ties. If the token at position 4,500 has term frequency F in the data, all other tokens with that frequency F will also be kept. Thus, in your case, there are at least 923 other tokens that have the same frequency as the token at position 4,500.

SMO learns a support vector machine, which can perform implicit attribute selection. This is one of its primary benefits.

The “outputWordCount” option determines whether the attribute values of the instances correspond to counts or just binary indicators. You can see the effect of this setting by opening the data in the dataset editor.

Cheers,
Eibe

> On 23/09/2019, at 1:06 AM, Marina Santini <[hidden email]> wrote:
>
> Thanks, Eibe, for your quick reply.
>
> I have checked the @attribute list in the output and there I have 5423 attributes, although I specified that 4500 in the WordsToKeep field. I reckon the classifier should use only 4500 words. Why are these number different?
>
> I wonder also why the number of attribute "weights" is 5193, i.e. neither 5423 nor 4500. Could you please explain why is that?
>
> Also, I set the option called "outputWordCount" to True. I thought that the wordCounts were indicated in it the file specified in the field "dictionaryFileSaveTo". But in your previous reply you say that the number in the second column indicates the document frequency and not the word frequency: where I can check the word frequency then? Where does " outputWordCount " store the output?
>
> I attach my output for reference.
>
> Thanks in advance.
>
> Cheers, Marina
>
>
>
> On Sun, 22 Sep 2019 at 13:22, Eibe Frank <[hidden email]> wrote:
> Looking at the code that generates this output, the number in the second column appears to be the *document* frequency, not the term frequency.
>
> If you want to see which words are used by the classifier, take a look at the textual description corresponding to the trained FilteredClassifier object. It should have a list of all the words that are used, as corresponding @attribute entries. Alternatively, use a classifier such as NaiveBayesMultinomial, which does not discard any attributes, and take a look at the textual description of the trained model.
>
> Cheers,
> Eibe
>
> On Sun, Sep 22, 2019 at 10:51 PM Marina Santini <[hidden email]> wrote:
> Hi,
>
> I have a question about the StringToVector filter. I am using this filter in a FilteredClassifier context.
>
> I need the list of the words that have been actually used in the classification task.
> For this reason, I set the following parameters:
>
> <image.png>
>
> Since I set a MinTermFreq to 5, I expected to find words with a frequency = or > 5 in the dictionary file (that I called eCare_WordList_5-10000). Instead this file contains words with a frequency lower than 5, eg.
>
> <image.png>
>
> How is that?
>
> Thanks in advance for your answer.
>
> Best wishes
> Marina
> _______________________________________________
> Wekalist mailing list -- [hidden email]
> Send posts to: To unsubscribe send an email to [hidden email]
> To subscribe, unsubscribe, etc., visit
> https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
> List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
> _______________________________________________
> Wekalist mailing list -- [hidden email]
> Send posts to: To unsubscribe send an email to [hidden email]
> To subscribe, unsubscribe, etc., visit
> https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
> List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
> <eCare_SMO_output_5-4500.txt>_______________________________________________
> Wekalist mailing list -- [hidden email]
> Send posts to: To unsubscribe send an email to [hidden email]
> To subscribe, unsubscribe, etc., visit
> https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
> List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: StringToVector filter, minimum term frequency

Marina Santini
Thanks for the answer, Eibe!

Have a great day

Marina

On Wed, 25 Sep 2019 at 05:17, Eibe Frank <[hidden email]> wrote:
The StringToWordVector filter does not break ties. If the token at position 4,500 has term frequency F in the data, all other tokens with that frequency F will also be kept. Thus, in your case, there are at least 923 other tokens that have the same frequency as the token at position 4,500.

SMO learns a support vector machine, which can perform implicit attribute selection. This is one of its primary benefits.

The “outputWordCount” option determines whether the attribute values of the instances correspond to counts or just binary indicators. You can see the effect of this setting by opening the data in the dataset editor.

Cheers,
Eibe

> On 23/09/2019, at 1:06 AM, Marina Santini <[hidden email]> wrote:
>
> Thanks, Eibe, for your quick reply.
>
> I have checked the @attribute list in the output and there I have 5423 attributes, although I specified that 4500 in the WordsToKeep field. I reckon the classifier should use only 4500 words. Why are these number different?
>
> I wonder also why the number of attribute "weights" is 5193, i.e. neither 5423 nor 4500. Could you please explain why is that?
>
> Also, I set the option called "outputWordCount" to True. I thought that the wordCounts were indicated in it the file specified in the field "dictionaryFileSaveTo". But in your previous reply you say that the number in the second column indicates the document frequency and not the word frequency: where I can check the word frequency then? Where does " outputWordCount " store the output?
>
> I attach my output for reference.
>
> Thanks in advance.
>
> Cheers, Marina
>
>
>
> On Sun, 22 Sep 2019 at 13:22, Eibe Frank <[hidden email]> wrote:
> Looking at the code that generates this output, the number in the second column appears to be the *document* frequency, not the term frequency.
>
> If you want to see which words are used by the classifier, take a look at the textual description corresponding to the trained FilteredClassifier object. It should have a list of all the words that are used, as corresponding @attribute entries. Alternatively, use a classifier such as NaiveBayesMultinomial, which does not discard any attributes, and take a look at the textual description of the trained model.
>
> Cheers,
> Eibe
>
> On Sun, Sep 22, 2019 at 10:51 PM Marina Santini <[hidden email]> wrote:
> Hi,
>
> I have a question about the StringToVector filter. I am using this filter in a FilteredClassifier context.
>
> I need the list of the words that have been actually used in the classification task.
> For this reason, I set the following parameters:
>
> <image.png>
>
> Since I set a MinTermFreq to 5, I expected to find words with a frequency = or > 5 in the dictionary file (that I called eCare_WordList_5-10000). Instead this file contains words with a frequency lower than 5, eg.
>
> <image.png>
>
> How is that?
>
> Thanks in advance for your answer.
>
> Best wishes
> Marina
> _______________________________________________
> Wekalist mailing list -- [hidden email]
> Send posts to: To unsubscribe send an email to [hidden email]
> To subscribe, unsubscribe, etc., visit
> https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
> List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
> _______________________________________________
> Wekalist mailing list -- [hidden email]
> Send posts to: To unsubscribe send an email to [hidden email]
> To subscribe, unsubscribe, etc., visit
> https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
> List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
> <eCare_SMO_output_5-4500.txt>_______________________________________________
> Wekalist mailing list -- [hidden email]
> Send posts to: To unsubscribe send an email to [hidden email]
> To subscribe, unsubscribe, etc., visit
> https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
> List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html


_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html