C4.5 Algorithm Output Issue

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

C4.5 Algorithm Output Issue

Abdrahman0x
Dear all,

I am using a C4.5 (Decision Tree algorithm) to classify my data. And the
result was as follows:

<http://weka.8497.n7.nabble.com/file/t6588/DTree.png>

The Root was calculated correct using the Information Gain ratio, and and
the values <= 56.91875 are also correct. Now the problem which I faced is
the the attribute which comes on the right sub-tree (attr 682) is not the
one with next highest information gain value. Can anyone explain to me why
and how the Weka computes the sub-tree. Note that my values are numeric
attributes with a nominal binary class values (positive and negative).

Also, when I checked for the (attr 682), the number of positives less than
107.4425 are NOT correct.

I would appreciate the help to clarify these points.

Thank you
Abdrahman



--
Sent from: http://weka.8497.n7.nabble.com/
_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: C4.5 Algorithm Output Issue

Eibe Frank-2
Administrator
Before you calculated the quality of the split, did you reduce the subset of data to those instances for which attribute1671 > 56.91… first?

To see exactly how a split is chosen, take a look at the code in C45Split.java and C45ModelSelection.java. The C45 split selection process for numeric attributes actually employs several heuristics on top of info gain/gain ratio:

- Only split points that carve off at least a certain minimum amount of data are considered for splitting at all (see C45Split.java).

- From the remaining split points for the current attribute, the one with the best information gain is chosen (see C45Split.java).

- The information gain for this split point is adjusted using a heuristic motivated by the minimum description length principle (see C45Split.java).

- The resulting attribute-cum-split-point is only admissible if it has greater than average information gain (see C45ModelSelection.java).

- Of all the attribute-cum-splits points, one for each attribute, the one with the best gain ratio is chosen for splitting  (see C45ModelSelection.java).

Cheers,
Eibe

> On 26/11/2018, at 2:21 AM, Abdrahman0x <[hidden email]> wrote:
>
> Dear all,
>
> I am using a C4.5 (Decision Tree algorithm) to classify my data. And the
> result was as follows:
>
> <http://weka.8497.n7.nabble.com/file/t6588/DTree.png>
>
> The Root was calculated correct using the Information Gain ratio, and and
> the values <= 56.91875 are also correct. Now the problem which I faced is
> the the attribute which comes on the right sub-tree (attr 682) is not the
> one with next highest information gain value. Can anyone explain to me why
> and how the Weka computes the sub-tree. Note that my values are numeric
> attributes with a nominal binary class values (positive and negative).
>
> Also, when I checked for the (attr 682), the number of positives less than
> 107.4425 are NOT correct.
>
> I would appreciate the help to clarify these points.
>
> Thank you
> Abdrahman
>
>
>
> --
> Sent from: http://weka.8497.n7.nabble.com/
> _______________________________________________
> Wekalist mailing list
> Send posts to: [hidden email]
> To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
> List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: C4.5 Algorithm Output Issue

Abdrahman0x
Thank you Eibe,

But how can I do this:


Eibe Frank-2 wrote
> Before you calculated the quality of the split, did you reduce the subset
> of data to those instances for which attribute1671 > 56.91… first?

How can I reduce the subset of data for the instances? Do you mean to apply
feature selection prior or what? I didnt get you, can you elaborate please.

Thank you



--
Sent from: http://weka.8497.n7.nabble.com/
_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: C4.5 Algorithm Output Issue

Eibe Frank-3
Find those rows in the data for which attribute1671 has a value greater than 56.91 and then do your calculations based on those rows of data only.

Cheers,
Eibe

On Mon, Dec 31, 2018 at 12:40 AM Abdrahman0x <[hidden email]> wrote:
Thank you Eibe,

But how can I do this:


Eibe Frank-2 wrote
> Before you calculated the quality of the split, did you reduce the subset
> of data to those instances for which attribute1671 > 56.91… first?

How can I reduce the subset of data for the instances? Do you mean to apply
feature selection prior or what? I didnt get you, can you elaborate please.

Thank you



--
Sent from: http://weka.8497.n7.nabble.com/
_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: C4.5 Algorithm Output Issue

Abdrahman0x
Thank you Mr. Eibe,

Do you mean that to look over the attribute682 values starting from the row
when the attribute1671 >= 56.91. I tried this way it didnt give me 4
positive, it gave me 6 positive to the lest of 10.4425.

I even calculated the information gain for the attribute682, and the number
of nodes to the left are different than 4 positives.

Can you please help.

Thank you



--
Sent from: http://weka.8497.n7.nabble.com/
_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: C4.5 Algorithm Output Issue

Eibe Frank-3
Let's say your data looks like this, with two predictor attributes a and b:

a,b,class
4,1,yes
2,5,no
1,7,yes
6,2,no

Let's say the split at the root node of the decision tree is a >= 3. Then, for the left and right successor node respectively, you will have to consider the two subsets of data

1,7,yes
2,5,no

and

4,1,yes
6,2,no

For both subsets of data, you then have to compute the split with maximum information gain (in this example, the two possible splits for each subset, one on attribute a and another one on attribute b, will be tied for maximum information gain).

Cheers,
Eibe





On Mon, Dec 31, 2018 at 9:53 PM Abdrahman0x <[hidden email]> wrote:
Thank you Mr. Eibe,

Do you mean that to look over the attribute682 values starting from the row
when the attribute1671 >= 56.91. I tried this way it didnt give me 4
positive, it gave me 6 positive to the lest of 10.4425.

I even calculated the information gain for the attribute682, and the number
of nodes to the left are different than 4 positives.

Can you please help.

Thank you



--
Sent from: http://weka.8497.n7.nabble.com/
_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: C4.5 Algorithm Output Issue

Abdrahman0x
Thank you Eibe,

Your elaboration is clear but I want to double check some points if
possible:

First, I calculate the Information Gain for each attribute, then the root
will be the one with the maximum or highest information gain (in my case
attr1671),

Second, from the subsets generated I will take the attribute with the second
highest information gain (in my case attr682), and do the splitting.

Third, I will recursively apply the same on the remaining rows of data.

One concern, how do we now that we must stop at attr201. I can understand
that there must be a stopping criteria (I tried to understand it but
couldnt), and I found also that this attribute had classified all the
remaining classes as well. Note that the splitting point of this attribute
was selected different than others.

I hope my points are clear.

Thank you





--
Sent from: http://weka.8497.n7.nabble.com/
_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html