C4.5 Algorithm Output Issue

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

C4.5 Algorithm Output Issue

Abdrahman0x
Dear all,

I am using a C4.5 (Decision Tree algorithm) to classify my data. And the
result was as follows:

<http://weka.8497.n7.nabble.com/file/t6588/DTree.png>

The Root was calculated correct using the Information Gain ratio, and and
the values <= 56.91875 are also correct. Now the problem which I faced is
the the attribute which comes on the right sub-tree (attr 682) is not the
one with next highest information gain value. Can anyone explain to me why
and how the Weka computes the sub-tree. Note that my values are numeric
attributes with a nominal binary class values (positive and negative).

Also, when I checked for the (attr 682), the number of positives less than
107.4425 are NOT correct.

I would appreciate the help to clarify these points.

Thank you
Abdrahman



--
Sent from: http://weka.8497.n7.nabble.com/
_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: C4.5 Algorithm Output Issue

Eibe Frank-2
Administrator
Before you calculated the quality of the split, did you reduce the subset of data to those instances for which attribute1671 > 56.91… first?

To see exactly how a split is chosen, take a look at the code in C45Split.java and C45ModelSelection.java. The C45 split selection process for numeric attributes actually employs several heuristics on top of info gain/gain ratio:

- Only split points that carve off at least a certain minimum amount of data are considered for splitting at all (see C45Split.java).

- From the remaining split points for the current attribute, the one with the best information gain is chosen (see C45Split.java).

- The information gain for this split point is adjusted using a heuristic motivated by the minimum description length principle (see C45Split.java).

- The resulting attribute-cum-split-point is only admissible if it has greater than average information gain (see C45ModelSelection.java).

- Of all the attribute-cum-splits points, one for each attribute, the one with the best gain ratio is chosen for splitting  (see C45ModelSelection.java).

Cheers,
Eibe

> On 26/11/2018, at 2:21 AM, Abdrahman0x <[hidden email]> wrote:
>
> Dear all,
>
> I am using a C4.5 (Decision Tree algorithm) to classify my data. And the
> result was as follows:
>
> <http://weka.8497.n7.nabble.com/file/t6588/DTree.png>
>
> The Root was calculated correct using the Information Gain ratio, and and
> the values <= 56.91875 are also correct. Now the problem which I faced is
> the the attribute which comes on the right sub-tree (attr 682) is not the
> one with next highest information gain value. Can anyone explain to me why
> and how the Weka computes the sub-tree. Note that my values are numeric
> attributes with a nominal binary class values (positive and negative).
>
> Also, when I checked for the (attr 682), the number of positives less than
> 107.4425 are NOT correct.
>
> I would appreciate the help to clarify these points.
>
> Thank you
> Abdrahman
>
>
>
> --
> Sent from: http://weka.8497.n7.nabble.com/
> _______________________________________________
> Wekalist mailing list
> Send posts to: [hidden email]
> To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
> List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html