missing values in continuous attributes

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

missing values in continuous attributes

Leonardo Sewald Cunha
Hello all:

I'm implementing a decision tree classifier (c4.5-based) and I'm facing
some problems in what concerns the handling of missing values for
continuous attributes in the learning (tree model building) stage. They
can't be handled like discrete attributes, right? It's important to
state that, for missing values in discrete attributes, I'm replacing the
missing values with a global constant (like "?" to denote a missing value).
I mean, are there any other options other than replacing the missing
value for something - mean value maybe (I believe this method might have
some disadvantages in what concerns the accuracy of the generated tree
model).
Any help would be greatly appreciated.

thanks in advance,

--
Leonardo Sewald Cunha




_______________________________________________
Wekalist mailing list
[hidden email]
https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
Reply | Threaded
Open this post in threaded view
|

RE: missing values in continuous attributes

subrat
Hi,

In using data with missing values, you are right in saying that using a global number like '?' or denoting it by some thing else, can create errors in the tree learnt. While using a continous valued attribute as against a discrete one, there are issues of complete solutions in many algorithms, hence people sometimes resort to 'discretising' the data before using it.

These are the methods you may consider trying while using missing continous attributes:

1. As you suggested, replace the missing numbers in an attribute by their respective means. Or even their mode values. This will bias the distribution towards the mean value or towards the most frequent values respectively, but it will prevent you from losing the information contained in the other attributes for that case if you had chosen to delete a case with any attributes missing!

2. A second and a more robust option that you can consider trying is that of Multiple Imputations. This is a technique where you don't replace by the mean, but instead try estimating the missing numbers by the maximum likelihood principle using the data available. I am not sure if this is available in Weka, but you can try/implement it and then import your data into Weka for classification.

3. Also in some researches like financial data or perhaps machine data, a missing value is reported as missing only. As this can have a semantic meaning to the domain. The techniques in 1 and 2 above should be used when you are sure that your data is missing at random.

I hope this helps,

Regards,
Subrat Nanda




-----Original Message-----
From: [hidden email]
[mailto:[hidden email]]On Behalf Of Leonardo
Sewald Cunha
Sent: Monday, June 27, 2005 8:04 PM
To: [hidden email]
Subject: [Wekalist] missing values in continuous attributes


Hello all:

I'm implementing a decision tree classifier (c4.5-based) and I'm facing
some problems in what concerns the handling of missing values for
continuous attributes in the learning (tree model building) stage. They
can't be handled like discrete attributes, right? It's important to
state that, for missing values in discrete attributes, I'm replacing the
missing values with a global constant (like "?" to denote a missing value).
I mean, are there any other options other than replacing the missing
value for something - mean value maybe (I believe this method might have
some disadvantages in what concerns the accuracy of the generated tree
model).
Any help would be greatly appreciated.

thanks in advance,

--
Leonardo Sewald Cunha




_______________________________________________
Wekalist mailing list
[hidden email]
https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist

_______________________________________________
Wekalist mailing list
[hidden email]
https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist