student question to CfsSubsetEval and Random forest

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

student question to CfsSubsetEval and Random forest

Hao Li

Respected Weka datamining team.
 


I am a student and have a couple short question (very simple questions despite the long text below) regarding the feature selection
algorithm CfsSubsetEval and random forest algorithm in your software. 

I somehow cannot register on your mailing list. It constantly says please verify you are not a robot. But I don’t see any verification options. I found out that I can directly email to your address
“Also, when sending a new message, it is best to use your email client to send your message directly to [hidden email], “
So I send my question directly to you




Regarding the CfsSubsetEval feature selection.
As I understand it from mark hall’s paper, CfsSubsetEval needs to discretize the input data. However, Weka handles the discretization automatically. Basically I can feed normal numerical data to the CfsSubsetEval in weka GUI. Provided all features except the Class are continous numeric variables Is this correct? For example I want to perform feature selection on the IRIS dataset, 4 features are continous and the Class (the 3 flower species) is nominal.

Such a dataset can be directly fed to the weka GUI e.g.  weka_explorer->openfile->select_attributes->Cfssubseteval is default, leave everything as default->use_full_training_set->Nom(Class)->start

Is the above process for feature selection of iris correct? My questions is, do I need to do any preprocessing on iris style dataset before using Cfssubseteval?

I am asking this because I heard someone say that one needs to use the discretize function in Weka to discreetize the input data BEFORE feeding it to CFSsubsetEval. But that’s how I understand it. I base my conclusion that Weka automatically converts the input data into proper format for CfsSubseteval on the following 3 pieces of information. 

(1) I get no error message trying to use Cfssubseteval on datasets consisting of all continous features with 1 nominal class.

(2) The CFSsubsetEval source code as by another user’s post It clearly says that the training data(features of input data set) is automatically discretized if the Class is not numberic (e.g. iris flowers.) I dug into the source code myself and came to the conclusion that Weka automatically discretizes the input data for IRIS style datasets, I don’t have to do the discretizing my self for CFSsubsetEval to run correctly. Is this correct?
*****************************************************
https://stackoverflow.com/questions/26695120/in-weka-how-can-i-stop-cfssubseteval-from-discretizing-training-instances

I am trying to write a java program which calls CfsSubsetEval class in Weka to perform feature subset selection. CfsSubsetEval discretises the dataset, and I am trying to avoid that as the dataset is already discretized. The following are the lines from CfsSubsetEval.java that performs the discretization.

m_isNumeric = m_trainInstances.attribute(m_classIndex).isNumeric();

if (!m_isNumeric)
{
    m_disTransform = new Discretize();
    m_disTransform.setUseBetterEncoding(true);
    m_disTransform.setInputFormat(m_trainInstances);
    m_trainInstances = Filter.useFilter(m_trainInstances, m_disTransform);
}

Since the class attribute is defined in the arff file as follows:

@ATTRIBUTE class {true,false}

the attribute is not numeric, and hence the discretization is performed.
 


(3) I made up an simple dataset with 1 obvious important variable. 3 features that are continous data and one nominal class. CfsSubseteval has no problem finding the important feature despite I feed continuous variables to it.

The made up dataset is as follows:
**************************************************
@relation 'weka dummy input for variable importance example'

@attribute V1 numeric
@attribute V2 numeric
@attribute V3 numeric
@attribute Category {A,B}

@data
0.15,123,7777,A
0.56,123,7777,A
0.74,123,7776,A
0.68,123,7777,A
0.57,123,7777,A
4,123,7777,B
6,123,7777,B
8,123,7777,B
2,121,7777,B
1,122,7777,B
**************************************************


Obviously V1 is the most important feature of this dataset, all features are continous and there is one nominal class. Below is the CfssubsetEval selection result



**************************************************
=== Run information ===

Evaluator:    weka.attributeSelection.CfsSubsetEval -P 1 -E 1
Search:       weka.attributeSelection.BestFirst -D 1 -N 5
Relation:     weka dummy input for variable importance example
Instances:    10
Attributes:   4
              V1
              V2
              V3
              Category
Evaluation mode:    evaluate on all training data



=== Attribute Selection on all input data ===

Search Method:
Best first.
Start set: no attributes
Search direction: forward
Stale search after 5 node expansions
Total number of subsets evaluated: 6
Merit of best subset found:    1    

Attribute Subset Evaluator (supervised, Class (nominal): 4 Category):
CFS Subset Evaluator
Including locally predictive attributes

Selected attributes: 1 : 1
                     V1
**************************************************

As can be seen Cfssubseteval has no problem identifying the important variable v1. I tried to switch places of V1 e.g. V1 becomes V3, the same result happens, Cfssubseteval has no problem finding the correct one. So I can feed unmodified, unscaled(e.g. not coverted to z-score) raw iris-style datasets to Cfssubseteval . Is this correct?

Very Sorry for my silly questions, I just want to make sure I understood everything correctly. Please answer quickly and thanks in advance!


Another small question is that Weka Random forest variable importance (Average decrease of entropy). The variable with the largest value is also the most important one right? I tried this out using the same simple made up dataset as above Obviously V1 is the most important variable. Weka says the variable importance is as follows.So if feature has high value in average impurity decrease. It is also more important correct?
      0.92 (   100)  V1
      0.19 (    12)  V3
      0.19 (    21)  V2

***************
   Category   V1  V2   V3
1         A  0.15 123 7777
2         A  0.56 123 7777
3         A  0.74 123 7776
4         A  0.68 123 7777
5         A  0.57 123 7777
6         B  4.00 123 7777
7         B  6.00 123 7777
8         B  8.00 123 7777
9         B  2.00 121 7777
10        B  1.00 122 7777


=== Run information === v1v2v3 original arrangement
Scheme:       weka.classifiers.trees.RandomForest -P 100 -attribute-importance -I 100 -num-slots 1 -K 0 -M 1.0 -V 0.001 -S 1
Relation:     weka dummy input for variable importance example
Instances:    10
Attributes:   4
              V1
              V2
              V3
              Category
Test mode:    evaluate on training data
=== Classifier model (full training set) ===
RandomForest
Bagging with 100 iterations and base learner
weka.classifiers.trees.RandomTree -K 0 -M 1.0 -V 0.001 -S 1 -do-not-check-capabilities
Attribute importance based on average impurity decrease (and number of nodes using that attribute)
      0.92 (   100)  V1
      0.19 (    12)  V3
      0.19 (    21)  V2
Time taken to build model: 0.08 seconds
=== Evaluation on training set ===
Time taken to test model on training data: 0.03 seconds
=== Summary ===
Correctly Classified Instances          10              100      %
Incorrectly Classified Instances         0                0      %
Kappa statistic                          1    
Mean absolute error                      0.05  
Root mean squared error                  0.1138
Relative absolute error                 10      %
Root relative squared error             22.7508 %
Total Number of Instances               10    
=== Detailed Accuracy By Class ===
                 TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
                 1.000    0.000    1.000      1.000    1.000      1.000    1.000     1.000     A
                 1.000    0.000    1.000      1.000    1.000      1.000    1.000     1.000     B
Weighted Avg.    1.000    0.000    1.000      1.000    1.000      1.000    1.000     1.000    
=== Confusion Matrix ===
 a b   <-- classified as
 5 0 | a = A
 0 5 | b = B



_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: student question to CfsSubsetEval and Random forest

Eibe Frank-2
Administrator
Yes, that's all correct. Note that you should use the AttributeSelectedClassifer for supervised attribute selection with CFS, to avoid tuning the attribute set to the test data and getting optimistic performance estimates. (Given your other message, I assume that's what you are doing now so it should be all good.)

Cheers,
Eibe

On Sat, Mar 7, 2020 at 2:17 PM Hao Li <[hidden email]> wrote:

Respected Weka datamining team.
 


I am a student and have a couple short question (very simple questions despite the long text below) regarding the feature selection
algorithm CfsSubsetEval and random forest algorithm in your software. 

I somehow cannot register on your mailing list. It constantly says please verify you are not a robot. But I don’t see any verification options. I found out that I can directly email to your address
“Also, when sending a new message, it is best to use your email client to send your message directly to [hidden email], “
So I send my question directly to you




Regarding the CfsSubsetEval feature selection.
As I understand it from mark hall’s paper, CfsSubsetEval needs to discretize the input data. However, Weka handles the discretization automatically. Basically I can feed normal numerical data to the CfsSubsetEval in weka GUI. Provided all features except the Class are continous numeric variables Is this correct? For example I want to perform feature selection on the IRIS dataset, 4 features are continous and the Class (the 3 flower species) is nominal.

Such a dataset can be directly fed to the weka GUI e.g.  weka_explorer->openfile->select_attributes->Cfssubseteval is default, leave everything as default->use_full_training_set->Nom(Class)->start

Is the above process for feature selection of iris correct? My questions is, do I need to do any preprocessing on iris style dataset before using Cfssubseteval?

I am asking this because I heard someone say that one needs to use the discretize function in Weka to discreetize the input data BEFORE feeding it to CFSsubsetEval. But that’s how I understand it. I base my conclusion that Weka automatically converts the input data into proper format for CfsSubseteval on the following 3 pieces of information. 

(1) I get no error message trying to use Cfssubseteval on datasets consisting of all continous features with 1 nominal class.

(2) The CFSsubsetEval source code as by another user’s post It clearly says that the training data(features of input data set) is automatically discretized if the Class is not numberic (e.g. iris flowers.) I dug into the source code myself and came to the conclusion that Weka automatically discretizes the input data for IRIS style datasets, I don’t have to do the discretizing my self for CFSsubsetEval to run correctly. Is this correct?
*****************************************************

I am trying to write a java program which calls CfsSubsetEval class in Weka to perform feature subset selection. CfsSubsetEval discretises the dataset, and I am trying to avoid that as the dataset is already discretized. The following are the lines from CfsSubsetEval.java that performs the discretization.

m_isNumeric = m_trainInstances.attribute(m_classIndex).isNumeric();

if (!m_isNumeric)
{
    m_disTransform = new Discretize();
    m_disTransform.setUseBetterEncoding(true);
    m_disTransform.setInputFormat(m_trainInstances);
    m_trainInstances = Filter.useFilter(m_trainInstances, m_disTransform);
}

Since the class attribute is defined in the arff file as follows:

@ATTRIBUTE class {true,false}

the attribute is not numeric, and hence the discretization is performed.
 


(3) I made up an simple dataset with 1 obvious important variable. 3 features that are continous data and one nominal class. CfsSubseteval has no problem finding the important feature despite I feed continuous variables to it.

The made up dataset is as follows:
**************************************************
@relation 'weka dummy input for variable importance example'

@attribute V1 numeric
@attribute V2 numeric
@attribute V3 numeric
@attribute Category {A,B}

@data
0.15,123,7777,A
0.56,123,7777,A
0.74,123,7776,A
0.68,123,7777,A
0.57,123,7777,A
4,123,7777,B
6,123,7777,B
8,123,7777,B
2,121,7777,B
1,122,7777,B
**************************************************


Obviously V1 is the most important feature of this dataset, all features are continous and there is one nominal class. Below is the CfssubsetEval selection result



**************************************************
=== Run information ===

Evaluator:    weka.attributeSelection.CfsSubsetEval -P 1 -E 1
Search:       weka.attributeSelection.BestFirst -D 1 -N 5
Relation:     weka dummy input for variable importance example
Instances:    10
Attributes:   4
              V1
              V2
              V3
              Category
Evaluation mode:    evaluate on all training data



=== Attribute Selection on all input data ===

Search Method:
Best first.
Start set: no attributes
Search direction: forward
Stale search after 5 node expansions
Total number of subsets evaluated: 6
Merit of best subset found:    1    

Attribute Subset Evaluator (supervised, Class (nominal): 4 Category):
CFS Subset Evaluator
Including locally predictive attributes

Selected attributes: 1 : 1
                     V1
**************************************************

As can be seen Cfssubseteval has no problem identifying the important variable v1. I tried to switch places of V1 e.g. V1 becomes V3, the same result happens, Cfssubseteval has no problem finding the correct one. So I can feed unmodified, unscaled(e.g. not coverted to z-score) raw iris-style datasets to Cfssubseteval . Is this correct?

Very Sorry for my silly questions, I just want to make sure I understood everything correctly. Please answer quickly and thanks in advance!


Another small question is that Weka Random forest variable importance (Average decrease of entropy). The variable with the largest value is also the most important one right? I tried this out using the same simple made up dataset as above Obviously V1 is the most important variable. Weka says the variable importance is as follows.So if feature has high value in average impurity decrease. It is also more important correct?
      0.92 (   100)  V1
      0.19 (    12)  V3
      0.19 (    21)  V2

***************
   Category   V1  V2   V3
1         A  0.15 123 7777
2         A  0.56 123 7777
3         A  0.74 123 7776
4         A  0.68 123 7777
5         A  0.57 123 7777
6         B  4.00 123 7777
7         B  6.00 123 7777
8         B  8.00 123 7777
9         B  2.00 121 7777
10        B  1.00 122 7777


=== Run information === v1v2v3 original arrangement
Scheme:       weka.classifiers.trees.RandomForest -P 100 -attribute-importance -I 100 -num-slots 1 -K 0 -M 1.0 -V 0.001 -S 1
Relation:     weka dummy input for variable importance example
Instances:    10
Attributes:   4
              V1
              V2
              V3
              Category
Test mode:    evaluate on training data
=== Classifier model (full training set) ===
RandomForest
Bagging with 100 iterations and base learner
weka.classifiers.trees.RandomTree -K 0 -M 1.0 -V 0.001 -S 1 -do-not-check-capabilities
Attribute importance based on average impurity decrease (and number of nodes using that attribute)
      0.92 (   100)  V1
      0.19 (    12)  V3
      0.19 (    21)  V2
Time taken to build model: 0.08 seconds
=== Evaluation on training set ===
Time taken to test model on training data: 0.03 seconds
=== Summary ===
Correctly Classified Instances          10              100      %
Incorrectly Classified Instances         0                0      %
Kappa statistic                          1    
Mean absolute error                      0.05  
Root mean squared error                  0.1138
Relative absolute error                 10      %
Root relative squared error             22.7508 %
Total Number of Instances               10    
=== Detailed Accuracy By Class ===
                 TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
                 1.000    0.000    1.000      1.000    1.000      1.000    1.000     1.000     A
                 1.000    0.000    1.000      1.000    1.000      1.000    1.000     1.000     B
Weighted Avg.    1.000    0.000    1.000      1.000    1.000      1.000    1.000     1.000    
=== Confusion Matrix ===
 a b   <-- classified as
 5 0 | a = A
 0 5 | b = B


_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html