

Dear all,
I am using a C4.5 (Decision Tree algorithm) to classify my data. And the
result was as follows:
< http://weka.8497.n7.nabble.com/file/t6588/DTree.png>
The Root was calculated correct using the Information Gain ratio, and and
the values <= 56.91875 are also correct. Now the problem which I faced is
the the attribute which comes on the right subtree (attr 682) is not the
one with next highest information gain value. Can anyone explain to me why
and how the Weka computes the subtree. Note that my values are numeric
attributes with a nominal binary class values (positive and negative).
Also, when I checked for the (attr 682), the number of positives less than
107.4425 are NOT correct.
I would appreciate the help to clarify these points.
Thank you
Abdrahman

Sent from: http://weka.8497.n7.nabble.com/_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalistList etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

Administrator

Before you calculated the quality of the split, did you reduce the subset of data to those instances for which attribute1671 > 56.91… first?
To see exactly how a split is chosen, take a look at the code in C45Split.java and C45ModelSelection.java. The C45 split selection process for numeric attributes actually employs several heuristics on top of info gain/gain ratio:
 Only split points that carve off at least a certain minimum amount of data are considered for splitting at all (see C45Split.java).
 From the remaining split points for the current attribute, the one with the best information gain is chosen (see C45Split.java).
 The information gain for this split point is adjusted using a heuristic motivated by the minimum description length principle (see C45Split.java).
 The resulting attributecumsplitpoint is only admissible if it has greater than average information gain (see C45ModelSelection.java).
 Of all the attributecumsplits points, one for each attribute, the one with the best gain ratio is chosen for splitting (see C45ModelSelection.java).
Cheers,
Eibe
> On 26/11/2018, at 2:21 AM, Abdrahman0x < [hidden email]> wrote:
>
> Dear all,
>
> I am using a C4.5 (Decision Tree algorithm) to classify my data. And the
> result was as follows:
>
> < http://weka.8497.n7.nabble.com/file/t6588/DTree.png>
>
> The Root was calculated correct using the Information Gain ratio, and and
> the values <= 56.91875 are also correct. Now the problem which I faced is
> the the attribute which comes on the right subtree (attr 682) is not the
> one with next highest information gain value. Can anyone explain to me why
> and how the Weka computes the subtree. Note that my values are numeric
> attributes with a nominal binary class values (positive and negative).
>
> Also, when I checked for the (attr 682), the number of positives less than
> 107.4425 are NOT correct.
>
> I would appreciate the help to clarify these points.
>
> Thank you
> Abdrahman
>
>
>
> 
> Sent from: http://weka.8497.n7.nabble.com/> _______________________________________________
> Wekalist mailing list
> Send posts to: [hidden email]
> To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist> List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalistList etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html


Let's say your data looks like this, with two predictor attributes a and b:
a,b,class 4,1,yes 2,5,no 1,7,yes 6,2,no
Let's say the split at the root node of the decision tree is a >= 3. Then, for the left and right successor node respectively, you will have to consider the two subsets of data
1,7,yes 2,5,no
and
4,1,yes 6,2,no
For both subsets of data, you then have to compute the split with maximum information gain (in this example, the two possible splits for each subset, one on attribute a and another one on attribute b, will be tied for maximum information gain).
Cheers, Eibe
_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalistList etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html


Thank you Eibe,
Your elaboration is clear but I want to double check some points if
possible:
First, I calculate the Information Gain for each attribute, then the root
will be the one with the maximum or highest information gain (in my case
attr1671),
Second, from the subsets generated I will take the attribute with the second
highest information gain (in my case attr682), and do the splitting.
Third, I will recursively apply the same on the remaining rows of data.
One concern, how do we now that we must stop at attr201. I can understand
that there must be a stopping criteria (I tried to understand it but
couldnt), and I found also that this attribute had classified all the
remaining classes as well. Note that the splitting point of this attribute
was selected different than others.
I hope my points are clear.
Thank you

Sent from: http://weka.8497.n7.nabble.com/_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalistList etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

