Re: What's the difference between "bag-of-words" and "term-frequency" in the NaiveBayesMultinomialText weka algorithm?
Using a bag-of-words representation just means that you treat a document as an unordered collection of words, i.e., the order of the words in the document is considered immaterial (which is clearly a simplification). Mathematically, a bag is like a set, but it can contain multiple copies of the same element. This is appropriate because a document can contain a word more than once. In the simplest case, term frequency refers to the number of times a particular term (i.e., word) occurs in a bag.
By default, StringToWordVector, NaiveBayesMultinomialText, and SGDText treat a document as a set, i.e., the attributes that are extracted from a document are just binary indicators that show whether word occurs or does not occur in a document. The frequency of each term/word is ignored. However, if you set "useWordFrequencies" to the value "True", the term frequencies will be used and consequently the attribute values will be non-negative integers, i.e., the document will be treated as a bag.
On Tue, Nov 12, 2019 at 8:26 AM BrendaAlexsandra <[hidden email]> wrote:
To classify my text, I can use these two techniques, but I don't know the
difference between them (for me it's the same thing) and I can't find in the
documentation what these attributes represent.