FixedDictionaryStringToWordVector IDF transform broken

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

FixedDictionaryStringToWordVector IDF transform broken

Manuel Leuenberger
Hi all,

I am writing here to report an issue with the FixedDictionaryStringToWordVector filter when using the IDF transform.

I am using WEKA 3.8.4. I am building a pipeline to classify text and wanted to use TF-IDF features. As I want to use different datasets, I am building a word dictionary using the StringToWordVector filter on a training set, then save that dictionary for later use with the FixedDictionaryStringToWordVector filter on various test sets. So far so good, but I noticed that using the IDF transform of FixedDictionaryStringToWordVector produces -Infinity values for all words. I dug into the source and found that the DictionaryBuilder vectorizer relies on its m_count property in the IDF transform in vectorizeInstance(), which is always 0 for dictionaries loaded by FixedDictionaryStringToWordVector. I can work around this by setting the field to a constant after setting the input format through reflection, but this is a really ugly hack. I think the serialization of the Dictionary should include the m_count and load it accordingly in the builder.

I think this is a bug, as the IDF transform in FixedDictionaryStringToWordVector is producing useless/wrong/constant results without my ugly workaround. Should I create an issue for this in JIRA, or is the mailing list the appropriate entry point for this?

Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
List etiquette: