StringToWordVector with new documents and TF-IDF help needed

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

StringToWordVector with new documents and TF-IDF help needed

trubinsh
Hi everyone!
I am building focused web crawler using weka for text classification. I have
already made training data set and test data set, with 720 attributes, using
BatchFiltering with StringToWordVector. Now I want to classify new
documents, which I was able to do earlier, by manually tokenizing and
counting attribute occurrences, but now I started using TF-IDF for
attributes, for better results, and can't get new instances filtered. I
tried following  this
<https://weka.programmingpedia.net/en/tutorial/7753/text-classification>  
tutorial, but it didn't help. Filter was not happy with different attribute
amounts. I also tried creating FixedDictionaryStringToWordVector, but
couldn't get it to work. If someone could guide me in the right direction or
post a link to helpful tutorial, I would really appreciate that.

Thanks in advance!



--
Sent from: https://weka.8497.n7.nabble.com/
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: StringToWordVector with new documents and TF-IDF help needed

Manuel Leuenberger
Hi,

I noticed as well that the FixedDictionaryStringToWordVector does not produce correct results when using IDF, see attached message. As a work-around, I messed a bit around with reflection to make it behave, as follows:
   int i = instances.attribute("text").index() + 1;
FixedDictionaryStringToWordVector filter = new FixedDictionaryStringToWordVector();
filter.setLowerCaseTokens(true);
filter.setStemmer(new IteratedLovinsStemmer());
filter.setOutputWordCounts(true);
filter.setTFTransform(true);
filter.setIDFTransform(true);
filter.setAttributeNamePrefix("tfidf-");
filter.setAttributeIndices(String.format("%d-%d", i, i));
filter.setDictionaryFile(this.dictionary);
filter.setInputFormat(instances);
// fix broken m_count in dictionary build, any positive constant will work
Field mCount = DictionaryBuilder.class.getDeclaredField("m_count");
mCount.setAccessible(true);
Field mVectorizer = FixedDictionaryStringToWordVector.class.getDeclaredField("m_vectorizer");
mVectorizer.setAccessible(true);
mCount.set(mVectorizer.get(filter), 1000);
return Filter.useFilter(instances, filter);

Cheers,
Manuel


Hi all,

I am writing here to report an issue with the FixedDictionaryStringToWordVector filter when using the IDF transform.

I am using WEKA 3.8.4. I am building a pipeline to classify text and wanted to use TF-IDF features. As I want to use different datasets, I am building a word dictionary using the StringToWordVector filter on a training set, then save that dictionary for later use with the FixedDictionaryStringToWordVector filter on various test sets. So far so good, but I noticed that using the IDF transform of FixedDictionaryStringToWordVector produces -Infinity values for all words. I dug into the source and found that the DictionaryBuilder vectorizer relies on its m_count property in the IDF transform in vectorizeInstance(), which is always 0 for dictionaries loaded by FixedDictionaryStringToWordVector. I can work around this by setting the field to a constant after setting the input format through reflection, but this is a really ugly hack. I think the serialization of the Dictionary should include the m_count and load it accordingly in the builder.

I think this is a bug, as the IDF transform in FixedDictionaryStringToWordVector is producing useless/wrong/constant results without my ugly workaround. Should I create an issue for this in JIRA, or is the mailing list the appropriate entry point for this?

Cheers,
Manuel




On 7 May 2020, at 23:55, trubinsh <[hidden email]> wrote:

Hi everyone!
I am building focused web crawler using weka for text classification. I have
already made training data set and test data set, with 720 attributes, using
BatchFiltering with StringToWordVector. Now I want to classify new
documents, which I was able to do earlier, by manually tokenizing and
counting attribute occurrences, but now I started using TF-IDF for
attributes, for better results, and can't get new instances filtered. I
tried following  this
<https://weka.programmingpedia.net/en/tutorial/7753/text-classification>  
tutorial, but it didn't help. Filter was not happy with different attribute
amounts. I also tried creating FixedDictionaryStringToWordVector, but
couldn't get it to work. If someone could guide me in the right direction or
post a link to helpful tutorial, I would really appreciate that.

Thanks in advance!



--
Sent from: https://weka.8497.n7.nabble.com/
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html


_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: StringToWordVector with new documents and TF-IDF help needed

trubinsh
Thank you so much, Manuel, this is exactly what I needed.



--
Sent from: https://weka.8497.n7.nabble.com/
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: StringToWordVector with new documents and TF-IDF help needed

Mark Hall
In reply to this post by Manuel Leuenberger

Thanks for the bug report! This has now been fixed in all Weka branches by ensuring that the number of documents used to construct the dictionary is written into the dictionary file. You will need to regenerate your dictionary in order to get this information (or, for textual dictionaries, add a line to the top like so “@@@numDocs=xx@@@”, where xx = number of documents). The fix can be obtained in the next nightly snapshot of Weka.

 

Cheers,

Mark.

 

On 11/05/20, 1:11 AM, "Manuel Leuenberger" <[hidden email]> wrote:

 

Hi,

 

I noticed as well that the FixedDictionaryStringToWordVector does not produce correct results when using IDF, see attached message. As a work-around, I messed a bit around with reflection to make it behave, as follows:

   int i = instances.attribute("text").index() + 1;
  
FixedDictionaryStringToWordVector filter = new FixedDictionaryStringToWordVector();
  
filter.setLowerCaseTokens(true);
  
filter.setStemmer(new IteratedLovinsStemmer());
  
filter.setOutputWordCounts(true);
  
filter.setTFTransform(true);
  
filter.setIDFTransform(true);
  
filter.setAttributeNamePrefix("tfidf-");
  
filter.setAttributeIndices(String.format("%d-%d", i, i));
  
filter.setDictionaryFile(this.dictionary);
  
filter.setInputFormat(instances);
  
// fix broken m_count in dictionary build, any positive constant will work
  
Field mCount = DictionaryBuilder.class.getDeclaredField("m_count");
  
mCount.setAccessible(true);
  
Field mVectorizer = FixedDictionaryStringToWordVector.class.getDeclaredField("m_vectorizer");
  
mVectorizer.setAccessible(true);
  
mCount.set(mVectorizer.get(filter), 1000);
  
return Filter.useFilter(instances, filter);

 

Cheers,

Manuel

 


_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: StringToWordVector with new documents and TF-IDF help needed

Manuel Leuenberger
Hi Mark,

Perfect, thanks a lot!

Cheers,
Manuel

On 11 May 2020, at 01:38, Mark Hall <[hidden email]> wrote:

Thanks for the bug report! This has now been fixed in all Weka branches by ensuring that the number of documents used to construct the dictionary is written into the dictionary file. You will need to regenerate your dictionary in order to get this information (or, for textual dictionaries, add a line to the top like so “@@@numDocs=xx@@@”, where xx = number of documents). The fix can be obtained in the next nightly snapshot of Weka.
 
Cheers,
Mark.
 
On 11/05/20, 1:11 AM, "Manuel Leuenberger" <[hidden email]> wrote:
 
Hi,
 
I noticed as well that the FixedDictionaryStringToWordVector does not produce correct results when using IDF, see attached message. As a work-around, I messed a bit around with reflection to make it behave, as follows:
   int i = instances.attribute("text").index() + 1;
  
FixedDictionaryStringToWordVector filter = new FixedDictionaryStringToWordVector();
  
filter.setLowerCaseTokens(true);
  
filter.setStemmer(new IteratedLovinsStemmer());
  
filter.setOutputWordCounts(true);
  
filter.setTFTransform(true);
  
filter.setIDFTransform(true);
  
filter.setAttributeNamePrefix("tfidf-");
  
filter.setAttributeIndices(String.format("%d-%d", i, i));
  
filter.setDictionaryFile(this.dictionary);
  
filter.setInputFormat(instances);
  
// fix broken m_count in dictionary build, any positive constant will work
  
Field mCount = DictionaryBuilder.class.getDeclaredField("m_count");
  
mCount.setAccessible(true);
  
Field mVectorizer = FixedDictionaryStringToWordVector.class.getDeclaredField("m_vectorizer");
  
mVectorizer.setAccessible(true);
  
mCount.set(mVectorizer.get(filter), 1000);
  
return Filter.useFilter(instances, filter);
 
Cheers,
Manuel
 
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html


_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html