Quantcast

NaiveBayesMultinomialText Normalization bug

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

NaiveBayesMultinomialText Normalization bug

Tom Misage
I think I found a bug with the NaiveBayesMultinomialText classifier when using normalization and the distributionForInstance method.
using Weka 3.8/OSX 10.11.6/java1.8.0_20

In the distributionForInstance function method used to calculate class membership probabilities for a test instance....

CODE
@Override
public double[] distributionForInstance(Instance instance) throws Exception {
.
.
.
int allWords = 0.0; // bug - need to change to double
for document normalization (if in use)
double iNorm = 0;
double fv = 0;
.
.
.
if (ok) {
double freq = (m_wordFrequencies) ? feature.getValue().m_count : 1.0;
// double freq = (feature.getValue().m_count / iNorm * m_norm);
if (m_normalize) {
freq /= iNorm * m_norm;
}
allWords += freq;
ENDCODE

Basically the variable "allWords" which accrues all the word frequencies in a document is defined as an int while the frequency variable "freq" is defined as a double. This is only a problem when using doc normalization when frequencies will be fractional and allWords will lose that information for a final calculation to determine probabilities. Defining allWords as double fixes the problem.

I haven’t looked at the code for Weka 3.9 but it gives the same incorrect output.



_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: NaiveBayesMultinomialText Normalization bug

Mark Hall
Good catch! I've just committed a fix to stable-3-8 and trunk Weka. You can get the fix in the next nightly snapshot.

Thanks for the bug report!

Cheers,
Mark.

On 8/02/17, 11:23 AM, "Tom Misage" <[hidden email] on behalf of [hidden email]> wrote:

    I think I found a bug with the NaiveBayesMultinomialText classifier when using normalization and the distributionForInstance method.
    using Weka 3.8/OSX 10.11.6/java1.8.0_20
   
    In the distributionForInstance function method used to calculate class membership probabilities for a test instance....
   
    CODE
    @Override
    public double[] distributionForInstance(Instance instance) throws Exception {
    .
    .
    .
    int allWords = 0.0; // bug - need to change to double
    for document normalization (if in use)
    double iNorm = 0;
    double fv = 0;
    .
    .
    .
    if (ok) {
    double freq = (m_wordFrequencies) ? feature.getValue().m_count : 1.0;
    // double freq = (feature.getValue().m_count / iNorm * m_norm);
    if (m_normalize) {
    freq /= iNorm * m_norm;
    }
    allWords += freq;
    ENDCODE
   
    Basically the variable "allWords" which accrues all the word frequencies in a document is defined as an int while the frequency variable "freq" is defined as a double. This is only a problem when using doc normalization when frequencies will be fractional and allWords will lose that information for a final calculation to determine probabilities. Defining allWords as double fixes the problem.
   
    I haven’t looked at the code for Weka 3.9 but it gives the same incorrect output.
   
   
   
    _______________________________________________
    Wekalist mailing list
    Send posts to: [hidden email]
    List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
    List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
   


_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Loading...