Missing values in different classifiers

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Missing values in different classifiers

Fernando Bugni
Hello !

I wanted to know if someone could tell me how these classifiers work with missing values by default. I searched and I found these caracteristics. 

ZeroRule: doesn't use missing values. 

RIPPER (JRIP): It omits because it doesn't use them to build the decision tree (I'm not sure)

C4.5 (J48): It omits because the missing values don't have Information Gain so they are omitted.

SVM (Function SMO): By default, the missing values are taken to the value 0.

NaiveBayes (Identical implementation): The missing values are omitted because the Bayes formula doesn't take them in consideration.

So I used them all by default so I wanted to know what they are doing with the missing values. Are these statements correct? Anyone could help me?

Thanks in advance!
Bye!

--
Fernando Bugni

_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: http://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Missing values in different classifiers

Eibe Frank-2
Administrator

> On 20 Feb 2015, at 01:00, Fernando Bugni <[hidden email]> wrote:
>
> Hello !
>
> I wanted to know if someone could tell me how these classifiers work with missing values by default. I searched and I found these caracteristics.
>
> ZeroRule: doesn't use missing values.

Correct, it completely ignores all predictor attributes.

> RIPPER (JRIP): It omits because it doesn't use them to build the decision tree (I'm not sure)

No. Take a look a the original RIPPER paper. Missing values *fail* any test included in the rules.

> C4.5 (J48): It omits because the missing values don't have Information Gain so they are omitted.

No. Take a look at the C4.5 book. C4.5 adjusts the info gain based on the proportion of missing values. Also, instances with missing values for a test at a node get split into "fractional" instances before they are passed further down from that node.

> SVM (Function SMO): By default, the missing values are taken to the value 0.

No. SMO in WEKA applies the ReplaceMissingValues filter in WEKA.

> NaiveBayes (Identical implementation): The missing values are omitted because the Bayes formula doesn't take them in consideration.

Yes, missing values are simply skipped. This is possible because attributes are assumed to be conditionally independent.

Cheers,
Eibe
_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: http://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Missing values in different classifiers

Fernando Bugni
Thanks again for replying!

But I have doubts for JRIP, C4.5 and Function SMO. I'll explain each one.

JRIP: I checked the paper and you are right but what happend if you train JRIP using a dataset with missing values by default? we have rules as a result, so what are the values of the missing values?

Function SMO: when you run this classifier by default there is the parameter 'checks on' that it assumes that missing value has weighted equal to 0. I read this here: http://weka.sourceforge.net/doc.dev/weka/classifiers/functions/SMO.html in the part of valid options. Why did you say that use ReplaceMissingValues filter? So I just want to know what value the missing values take by default and (because of the link) I think is 0. Perhaps It uses the filter but change them to 0.

C4.5: let me explain what I understood if I'm right. When you run C4.5 it made a rule tree. The problem is (in test) when instance x tries to be classified, one attribute of x is missing and is in one node of the classification. So It propagates with weights proportional to frequencies of the observed non-missing values. And when both paths reach the leaves, It combines both results, return which has more weight and discard the other one.

I really appreciate your time and your explanations! Sorry for disturb.
Thanks in advance!
Bye!

On Thu, Feb 19, 2015 at 5:08 PM, Eibe Frank <[hidden email]> wrote:

> On 20 Feb 2015, at 01:00, Fernando Bugni <[hidden email]> wrote:
>
> Hello !
>
> I wanted to know if someone could tell me how these classifiers work with missing values by default. I searched and I found these caracteristics.
>
> ZeroRule: doesn't use missing values.

Correct, it completely ignores all predictor attributes.

> RIPPER (JRIP): It omits because it doesn't use them to build the decision tree (I'm not sure)

No. Take a look a the original RIPPER paper. Missing values *fail* any test included in the rules.

> C4.5 (J48): It omits because the missing values don't have Information Gain so they are omitted.

No. Take a look at the C4.5 book. C4.5 adjusts the info gain based on the proportion of missing values. Also, instances with missing values for a test at a node get split into "fractional" instances before they are passed further down from that node.

> SVM (Function SMO): By default, the missing values are taken to the value 0.

No. SMO in WEKA applies the ReplaceMissingValues filter in WEKA.

> NaiveBayes (Identical implementation): The missing values are omitted because the Bayes formula doesn't take them in consideration.

Yes, missing values are simply skipped. This is possible because attributes are assumed to be conditionally independent.

Cheers,
Eibe
_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: http://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html



--
Fernando Bugni

_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: http://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Missing values in different classifiers

Eibe Frank-2
Administrator

> On 21 Feb 2015, at 00:54, Fernando Bugni <[hidden email]> wrote:
>
> JRIP: I checked the paper and you are right but what happend if you train JRIP using a dataset with missing values by default? we have rules as a result, so what are the values of the missing values?

Missing values simply don't match any other values when the rules are evaluated. And missing values do not occur in the rules.

> Function SMO: when you run this classifier by default there is the parameter 'checks on' that it assumes that missing value has weighted equal to 0. I read this here: http://weka.sourceforge.net/doc.dev/weka/classifiers/functions/SMO.html in the part of valid options. Why did you say that use ReplaceMissingValues filter? So I just want to know what value the missing values take by default and (because of the link) I think is 0. Perhaps It uses the filter but change them to 0.

The "no-checks" option just allows you to turn off replacement of missing values with ReplaceMissingValues. This normally only makes sense if you know that your data does not have missing values and you want to reduce runtime.

> C4.5: let me explain what I understood if I'm right. When you run C4.5 it made a rule tree. The problem is (in test) when instance x tries to be classified, one attribute of x is missing and is in one node of the classification. So It propagates with weights proportional to frequencies of the observed non-missing values.

Yes.

> And when both paths reach the leaves, It combines both results, return which has more weight and discard the other one.

It doesn't discard anything. It merges the class probability distributions returned from the leaves by computing a weighted arithmetic average.

Cheers,
Eibe
_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: http://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Missing values in different classifiers

Fernando Bugni
Thank you for the answers!! They are really helpful! I really appreciate them.

These caracteristics are how each algoritm manages the missing values in testing. I wanted to know, if someone could tell me, how these classifiers work with missing values by default in training. I searched and I found these caracteristics for training. 

ZeroRule: doesn't use missing values in train phase.

RIPPER (JRIP): In training, I didn't find anything. I think that when it performs the greedy algorithm it discards them but I'm not sure.

C4.5 (J48): I found that "if node n tests A assign most common value of A among other examples sorted to node n". I couldn't understand in which part when built the decision tree it is performed.

SVM (Function SMO): It uses the ReplaceMissingValues before training.

NaiveBayes (Identical implementation): by default use dataset with missing values in training and then in the test phase just ignore them.

So are these statements correct in training phase? 

Thanks in advance!
Goodbye!

On Sun, Feb 22, 2015 at 5:23 PM, Eibe Frank <[hidden email]> wrote:

> On 21 Feb 2015, at 00:54, Fernando Bugni <[hidden email]> wrote:
>
> JRIP: I checked the paper and you are right but what happend if you train JRIP using a dataset with missing values by default? we have rules as a result, so what are the values of the missing values?

Missing values simply don't match any other values when the rules are evaluated. And missing values do not occur in the rules.

> Function SMO: when you run this classifier by default there is the parameter 'checks on' that it assumes that missing value has weighted equal to 0. I read this here: http://weka.sourceforge.net/doc.dev/weka/classifiers/functions/SMO.html in the part of valid options. Why did you say that use ReplaceMissingValues filter? So I just want to know what value the missing values take by default and (because of the link) I think is 0. Perhaps It uses the filter but change them to 0.

The "no-checks" option just allows you to turn off replacement of missing values with ReplaceMissingValues. This normally only makes sense if you know that your data does not have missing values and you want to reduce runtime.

> C4.5: let me explain what I understood if I'm right. When you run C4.5 it made a rule tree. The problem is (in test) when instance x tries to be classified, one attribute of x is missing and is in one node of the classification. So It propagates with weights proportional to frequencies of the observed non-missing values.

Yes.

> And when both paths reach the leaves, It combines both results, return which has more weight and discard the other one.

It doesn't discard anything. It merges the class probability distributions returned from the leaves by computing a weighted arithmetic average.

Cheers,
Eibe
_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: http://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html



--
Fernando Bugni

_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: http://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html