Question about InfoGainAttributeEval with Ranker

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Question about InfoGainAttributeEval with Ranker

Fernando Bugni
Hello !

I wanted to know if someone could tell me the range of the values of InformationGain from an attribute. For example: if I have

Ranked attributes:
 0.07231     att_1
 0.07217     att_2
 0.03963     att_3
 0.03963     att_4

I know that the attribute att_1 has more InformationGain than the others, but I want to know between which values could be. 
Is it high? is it low? I don't know.

I know that: InfoGain(Class,Attribute) = H(Class) - H(Class | Attribute)
and there's a property for entropy: 0 <= H <= log_a (n) ... where (i think) a = 2 and n = number of samples.
But I don't know know to use this propery in order to calculate the range of InformationGain. If I have a classification in two groups, how could use this property to calculate H(class) and H(class | Attribute) ?

Anyone could help me?

Thanks in advance!
Bye!
-- 
Fernando Bugni

_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: http://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Question about InfoGainAttributeEval with Ranker

Eibe Frank-2
Administrator

> On 11/01/2015, at 6:26 pm, Fernando Bugni <[hidden email]> wrote:
>
> I know that: InfoGain(Class,Attribute) = H(Class) - H(Class | Attribute)
> and there's a property for entropy: 0 <= H <= log_a (n) ... where (i think) a = 2 and n = number of samples.
> But I don't know know to use this propery in order to calculate the range of InformationGain. If I have a classification in two groups, how could use this property to calculate H(class) and H(class | Attribute) ?

The minimum information gain is zero, when H(Class) = H(Class | Attribute).

The maximum is achieved when H(Class | Attribute) = 0.

Entropy is maximal when all classes are equally likely, in which case it is log_b(c), where b = 2 (if entropy is calculated in bits) and c is the *NUMBER OF CLASS VALUES*.

In the two-class case, the maximum info gain is 1 bit (and occurs when both classes are equally likely a priori, before the attribute is considered).

However, in most datasets, not all classes are equally likely a priori, so H(Class) will be smaller than log_b(c).

You can calculate H(Class) in WEKA, in the Classify panel, by running any classifier (e.g., ZeroR), setting “Use training set” for evaluation, and “Output entropy evaluation measures” under “More options…”.

For example, running ZeroR on the iris data gives:

=== Summary ===

Correctly Classified Instances          50               33.3333 %
Incorrectly Classified Instances       100               66.6667 %
Kappa statistic                          0    
K&B Relative Info Score                  0      %
K&B Information Score                    0      bits      0      bits/instance
Class complexity | order 0             237.7444 bits      1.585  bits/instance
Class complexity | scheme              237.7444 bits      1.585  bits/instance
Complexity improvement     (Sf)          0      bits      0      bits/instance
Mean absolute error                      0.4444
Root mean squared error                  0.4714
Relative absolute error                100      %
Root relative squared error            100      %
Coverage of cases (0.95 level)         100      %
Mean rel. region size (0.95 level)     100      %
Total Number of Instances              150    

"Class complexity | order 0" gives you H(Class). It is 1.585  bits/instance for the iris data (rounded) because, for this data, all three classes are equally likely, so H(Class)=log_2(3).

Cheers,
Eibe
_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: http://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Tom
Reply | Threaded
Open this post in threaded view
|

Re: Question about InfoGainAttributeEval with Ranker

Tom
Hi,

I was reading this old conversation because I have the same question. I read the explanation of Eibe, but I don't understand why the log base is 2 instead of e.

When I look in the weka source (3.8.4) under ContingencyTables line 335 (which calculates the H( ) measures) it clearly uses base e. But the log_e(3) is 1.099, not 1.585.

My Ranker scores are 1.3793 and 1.2268 for the first two found attributes, so it would seem that they are base 2 (since I have three class values).

Is weka using base 2 or base e?

Cheers!
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Question about InfoGainAttributeEval with Ranker

Eibe Frank
Computer scientists like base 2 because that gives you a measure of information in bits.

That old thread is about entropy-based statistics computed by WEKA's evaluation module, when evaluating a classifier ("class complexity"). That is different from attribute evaluation using information gain based on the corresponding attribute evaluator.

Anyway, yes, base 2 is used in the corresponding method in ContingencyTables, see the return statement, which has a division by log2 (the natural logarithm of 2):

 public static double entropyOverColumns(double[][] matrix){
   
   double returnValue = 0, sumForColumn, total = 0;

   for (int j = 0; j < matrix[0].length; j++){
     sumForColumn = 0;
     for (int i = 0; i < matrix.length; i++) {
       sumForColumn += matrix[i][j];
     }
     returnValue = returnValue - lnFunc(sumForColumn);
     total += sumForColumn;
   }
   if (Utils.eq(total, 0)) {
     return 0;
   }
   return (returnValue + lnFunc(total)) / (total * log2);
 }

Cheers,
Eibe

On Tue, Feb 11, 2020 at 9:28 AM <[hidden email]> wrote:
Hi,

I was reading this old conversation because I have the same question. I read the explanation of Eibe, but I don't understand why the log base is 2 instead of e.

When I look in the weka source (3.8.4) under ContingencyTables line 335 (which calculates the H( ) measures) it clearly uses base e. But the log_e(3) is 1.099, not 1.585.

My Ranker scores are 1.3793 and 1.2268 for the first two found attributes, so it would seem that they are base 2 (since I have three class values).

Is weka using base 2 or base e?

Cheers!
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Tom
Reply | Threaded
Open this post in threaded view
|

Re: Question about InfoGainAttributeEval with Ranker

Tom
I missed the division in the return statement, heh, too focused on the lnFunc. Thanks for the explanation!

Cheers,
  Tom

On Tue, Feb 11, 2020 at 3:40 AM Eibe Frank <[hidden email]> wrote:
Computer scientists like base 2 because that gives you a measure of information in bits.

That old thread is about entropy-based statistics computed by WEKA's evaluation module, when evaluating a classifier ("class complexity"). That is different from attribute evaluation using information gain based on the corresponding attribute evaluator.

Anyway, yes, base 2 is used in the corresponding method in ContingencyTables, see the return statement, which has a division by log2 (the natural logarithm of 2):

 public static double entropyOverColumns(double[][] matrix){
   
   double returnValue = 0, sumForColumn, total = 0;

   for (int j = 0; j < matrix[0].length; j++){
     sumForColumn = 0;
     for (int i = 0; i < matrix.length; i++) {
       sumForColumn += matrix[i][j];
     }
     returnValue = returnValue - lnFunc(sumForColumn);
     total += sumForColumn;
   }
   if (Utils.eq(total, 0)) {
     return 0;
   }
   return (returnValue + lnFunc(total)) / (total * log2);
 }

Cheers,
Eibe

On Tue, Feb 11, 2020 at 9:28 AM <[hidden email]> wrote:
Hi,

I was reading this old conversation because I have the same question. I read the explanation of Eibe, but I don't understand why the log base is 2 instead of e.

When I look in the weka source (3.8.4) under ContingencyTables line 335 (which calculates the H( ) measures) it clearly uses base e. But the log_e(3) is 1.099, not 1.585.

My Ranker scores are 1.3793 and 1.2268 for the first two found attributes, so it would seem that they are base 2 (since I have three class values).

Is weka using base 2 or base e?

Cheers!
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html