Different subsets of features with different seeds

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Different subsets of features with different seeds

neha.bologna
Hi

I want to know when we select different seeds (i.e. 1 to 5) and then select features, we get different subsets of features for different seeds. However, for some datasets, it gives the same subset of features regardless of the seed number used. Do you know why this happens? Why with some data, the different number of seeds matter and for some other data, it does not affect the results?

Thank you

_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Different subsets of features with different seeds

Peter Reutemann-3
Are you referring to this unofficial package here?

http://flanagan.ugr.es/weka/

Are you using Weka 3.8.0? The package seems to have been developed for that version. It is possible that newer versions of Weka are no longer compatible with package.

Since this is not an officially supported package, you should try contacting the author if the problem persists.

Cheers, Peter

On October 21, 2020 11:26:40 AM GMT+13:00, Neha gupta <[hidden email]> wrote:

>Hi
>
>I want to know when we select different seeds (i.e. 1 to 5) and then
>select features, we get different subsets of features for different
>seeds.
>However, for some datasets, it gives the same subset of features
>regardless
>of the seed number used. Do you know why this happens? Why with some
>data,
>the different number of seeds matter and for some other data, it does
>not
>affect the results?
>
>Thank you

--
Peter Reutemann
Dept. of Computer Science
University of Waikato, NZ
+64 (7) 577-5304
http://www.cms.waikato.ac.nz/~fracpete/
http://www.data-mining.co.nz/
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Different subsets of features with different seeds

neha.bologna
Hello Peter..

I am sorry but I think I was unable to convey my message properly. Anyways, I have another related question.

My project is to evaluate the consistency of features extracted by different feature selection algorithms when small changes are made to the data. When I evaluate the consistency with original data (5 fold CV), I get consistency of metrics by FS algorithms about 45%. Now with the same dataset, when I normalize the data using Weka filter, the percentage of consistent metrics drops sharply to just 10% and it is true for almost all the datasets I am using. 

Could you plz elaborate why it happens? Why normalizing data reduces the consistency of FS algorithms. 

Thanks 

On Fri, Oct 23, 2020 at 10:53 PM Peter Reutemann <[hidden email]> wrote:
Are you referring to this unofficial package here?

http://flanagan.ugr.es/weka/

Are you using Weka 3.8.0? The package seems to have been developed for that version. It is possible that newer versions of Weka are no longer compatible with package.

Since this is not an officially supported package, you should try contacting the author if the problem persists.

Cheers, Peter

On October 21, 2020 11:26:40 AM GMT+13:00, Neha gupta <[hidden email]> wrote:
>Hi
>
>I want to know when we select different seeds (i.e. 1 to 5) and then
>select features, we get different subsets of features for different
>seeds.
>However, for some datasets, it gives the same subset of features
>regardless
>of the seed number used. Do you know why this happens? Why with some
>data,
>the different number of seeds matter and for some other data, it does
>not
>affect the results?
>
>Thank you

--
Peter Reutemann
Dept. of Computer Science
University of Waikato, NZ
+64 (7) 577-5304
http://www.cms.waikato.ac.nz/~fracpete/
http://www.data-mining.co.nz/
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Different subsets of features with different seeds

Peter Reutemann
> I am sorry but I think I was unable to convey my message properly.

Apologies. I was replying from my phone and must have accidentally
replied to the wrong email (tiny touch screens are a poor substitute
for screen/keyboard).

> Anyways, I have another related question.
>
> My project is to evaluate the consistency of features extracted by different feature selection algorithms when small changes are made to the data. When I evaluate the consistency with original data (5 fold CV), I get consistency of metrics by FS algorithms about 45%. Now with the same dataset, when I normalize the data using Weka filter, the percentage of consistent metrics drops sharply to just 10% and it is true for almost all the datasets I am using.
>
> Could you plz elaborate why it happens? Why normalizing data reduces the consistency of FS algorithms.

Without seeing your data or knowing what algorithms with what
parameters you're using it's hard to comment why this is happening
(not that I use attribute selection myself).
It's possible that normalizing your data generates very small numbers
which has an influence on how attribute selection algorithms calculate
their internal metrics.

Cheers, Peter
--
Peter Reutemann
Dept. of Computer Science
University of Waikato, NZ
+64 (7) 577-5304
http://www.cms.waikato.ac.nz/~fracpete/
http://www.data-mining.co.nz/
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Different subsets of features with different seeds

neha.bologna
Hi Peter, thanks for your reply.

It's possible that normalizing your data generates very small numbers
which has an influence on how attribute selection algorithms calculate
their internal metrics. 

I did not understand this statement. If we normalize our data, the range of attribute values becomes 0 to 1, right?  I don't know if it is the limitation of feature selection algorithms I am using that they are very sensitive to small changes in the training data. If I use a 10 fold CV, the percentage of consistent metrics are different, when using 3 folds and 5 folds , they are different. When I exclude some instances randomly, they are different. 

Best regards

On Tue, Oct 27, 2020 at 3:18 AM Peter Reutemann <[hidden email]> wrote:
> I am sorry but I think I was unable to convey my message properly.

Apologies. I was replying from my phone and must have accidentally
replied to the wrong email (tiny touch screens are a poor substitute
for screen/keyboard).

> Anyways, I have another related question.
>
> My project is to evaluate the consistency of features extracted by different feature selection algorithms when small changes are made to the data. When I evaluate the consistency with original data (5 fold CV), I get consistency of metrics by FS algorithms about 45%. Now with the same dataset, when I normalize the data using Weka filter, the percentage of consistent metrics drops sharply to just 10% and it is true for almost all the datasets I am using.
>
> Could you plz elaborate why it happens? Why normalizing data reduces the consistency of FS algorithms.

Without seeing your data or knowing what algorithms with what
parameters you're using it's hard to comment why this is happening
(not that I use attribute selection myself).
It's possible that normalizing your data generates very small numbers
which has an influence on how attribute selection algorithms calculate
their internal metrics.

Cheers, Peter
--
Peter Reutemann
Dept. of Computer Science
University of Waikato, NZ
+64 (7) 577-5304
http://www.cms.waikato.ac.nz/~fracpete/
http://www.data-mining.co.nz/
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Different subsets of features with different seeds

Peter Reutemann
> It's possible that normalizing your data generates very small numbers
> which has an influence on how attribute selection algorithms calculate
> their internal metrics.
>
> I did not understand this statement. If we normalize our data, the range of attribute values becomes 0 to 1, right?  I don't know if it is the limitation of feature selection algorithms I am using that they are very sensitive to small changes in the training data. If I use a 10 fold CV, the percentage of consistent metrics are different, when using 3 folds and 5 folds , they are different. When I exclude some instances randomly, they are different.

Changing the number of splits or removing instances will change the
data distributions. Most algorithms are sensitive to these changes
(sometimes even to the order).
Without looking at the code of the algorithms, I cannot really
comment. I know from other algorithms, that 1e-6 is sometimes used as
a threshold. So, very small numbers could have an impact.
So, inspect the code of the algorithms to see what's happening inside.
This might help you answer your question.

Cheers, Peter
--
Peter Reutemann
Dept. of Computer Science
University of Waikato, NZ
+64 (7) 577-5304
http://www.cms.waikato.ac.nz/~fracpete/
http://www.data-mining.co.nz/
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Different subsets of features with different seeds

neha.bologna

Thank you Peter for your useful information. 

Changing the number of splits or removing instances will change the
data distributions. Most algorithms are sensitive to these changes
(sometimes even to the order). 

Yes, my employed feature selection algorithms are very sensitive to the number of splits and excluding instances.. However, when I just changed the order of features in a dataset ( x1,x2,x3 to x3,x2,x1), I did not see any significant change. 

Best regards




On Tue, Oct 27, 2020 at 10:01 PM Peter Reutemann <[hidden email]> wrote:
> It's possible that normalizing your data generates very small numbers
> which has an influence on how attribute selection algorithms calculate
> their internal metrics.
>
> I did not understand this statement. If we normalize our data, the range of attribute values becomes 0 to 1, right?  I don't know if it is the limitation of feature selection algorithms I am using that they are very sensitive to small changes in the training data. If I use a 10 fold CV, the percentage of consistent metrics are different, when using 3 folds and 5 folds , they are different. When I exclude some instances randomly, they are different.

Changing the number of splits or removing instances will change the
data distributions. Most algorithms are sensitive to these changes
(sometimes even to the order).
Without looking at the code of the algorithms, I cannot really
comment. I know from other algorithms, that 1e-6 is sometimes used as
a threshold. So, very small numbers could have an impact.
So, inspect the code of the algorithms to see what's happening inside.
This might help you answer your question.

Cheers, Peter
--
Peter Reutemann
Dept. of Computer Science
University of Waikato, NZ
+64 (7) 577-5304
http://www.cms.waikato.ac.nz/~fracpete/
http://www.data-mining.co.nz/
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Different subsets of features with different seeds

Peter Reutemann
> Yes, my employed feature selection algorithms are very sensitive to the number of splits and excluding instances.. However, when I just changed the order of features in a dataset ( x1,x2,x3 to x3,x2,x1), I did not see any significant change.

I was referring to the order of rows, not columns. Column ordering
usually only affects tie breaks, e.g., always picking the first
attribute.

Cheers, Peter
--
Peter Reutemann
Dept. of Computer Science
University of Waikato, NZ
+64 (7) 577-5304
http://www.cms.waikato.ac.nz/~fracpete/
http://www.data-mining.co.nz/
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html