What's the difference between "bag-of-words" and "term-frequency" in the NaiveBayesMultinomialText weka algorithm?

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

What's the difference between "bag-of-words" and "term-frequency" in the NaiveBayesMultinomialText weka algorithm?

BrendaAlexsandra
To classify my text, I can use these two techniques, but I don't know the
difference between them (for me it's the same thing) and I can't find in the
documentation what these attributes represent.



--
Sent from: https://weka.8497.n7.nabble.com/
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: What's the difference between "bag-of-words" and "term-frequency" in the NaiveBayesMultinomialText weka algorithm?

Eibe Frank-2
Administrator
Using a bag-of-words representation just means that you treat a document as an unordered collection of words, i.e., the order of the words in the document is considered immaterial (which is clearly a simplification). Mathematically, a bag is like a set, but it can contain multiple copies of the same element. This is appropriate because a document can contain a word more than once. In the simplest case, term frequency refers to the number of times a particular term (i.e., word) occurs in a bag.

By default, StringToWordVector, NaiveBayesMultinomialText, and SGDText treat a document as a set, i.e., the attributes that are extracted from a document are just binary indicators that show whether word occurs or does not occur in a document. The frequency of each term/word is ignored. However, if you set "useWordFrequencies" to the value "True", the term frequencies will be used and consequently the attribute values will be non-negative integers, i.e., the document will be treated as a bag.

Cheers,
Eibe


On Tue, Nov 12, 2019 at 8:26 AM BrendaAlexsandra <[hidden email]> wrote:
To classify my text, I can use these two techniques, but I don't know the
difference between them (for me it's the same thing) and I can't find in the
documentation what these attributes represent.



--
Sent from: https://weka.8497.n7.nabble.com/
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html