Help with attributeNamePrefix

classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|

Help with attributeNamePrefix

valerio jus
Hi all, 

I did not figure out the goal of "attributeNamePrefix" (Prefix for the created attribute names) of "StringToWordVector". How to use this option?, example on that would help.

Appreciate any advice

_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Help with attributeNamePrefix

Peter Reutemann
> I did not figure out the goal of "attributeNamePrefix" (Prefix for the created attribute names) of "StringToWordVector". How to use this option?, example on that would help.

The StringToWordVector generates a lot of new attributes, based on the
words (actually tokens) it finds in the string attribute(s). If your
data already contains other attributes, then it is possible that newly
created attributes will create a clash, as Weka does not allow
duplicate attributes names in datasets. "attributeNamePrefix" can be
used to disambiguate the newly generated attribute by prefixing them
with the supplied string.
For example, you could use "w-" as prefix. Assuming that the text
contains the two words "hello world". Instead of generating two
attributes called "hello" and "world" it will generate "w-hello" and
"w-world".

Cheers, Peter
--
Peter Reutemann
Dept. of Computer Science
University of Waikato, NZ
+64 (7) 858-5174
http://www.cms.waikato.ac.nz/~fracpete/
http://www.data-mining.co.nz/
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Help with attributeNamePrefix

valerio jus
Hi Peter,

Thank you so much for the helpful reply. 

Take care and stay safe. 

Cheers, 
Valerio 

On Fri, 1 May 2020, 7:53 am Peter Reutemann, <[hidden email]> wrote:
> I did not figure out the goal of "attributeNamePrefix" (Prefix for the created attribute names) of "StringToWordVector". How to use this option?, example on that would help.

The StringToWordVector generates a lot of new attributes, based on the
words (actually tokens) it finds in the string attribute(s). If your
data already contains other attributes, then it is possible that newly
created attributes will create a clash, as Weka does not allow
duplicate attributes names in datasets. "attributeNamePrefix" can be
used to disambiguate the newly generated attribute by prefixing them
with the supplied string.
For example, you could use "w-" as prefix. Assuming that the text
contains the two words "hello world". Instead of generating two
attributes called "hello" and "world" it will generate "w-hello" and
"w-world".

Cheers, Peter
--
Peter Reutemann
Dept. of Computer Science
University of Waikato, NZ
+64 (7) 858-5174
http://www.cms.waikato.ac.nz/~fracpete/
http://www.data-mining.co.nz/
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Help with attributeNamePrefix

Edward Wiskers
Hi Peter,

In line with Valerio's question, reading about "periodicPruning", a parameter of StringToWords filter, is still not clear. The aim behind this pruning and its differences from the -w is not really clear also. Hence, could you please clarify the main goal of this parameter and example on its job.

Thanks in advance.

Edward 

On Fri, 1 May 2020, 8:50 pm Valerio jus, <[hidden email]> wrote:
Hi Peter,

Thank you so much for the helpful reply. 

Take care and stay safe. 

Cheers, 
Valerio 

On Fri, 1 May 2020, 7:53 am Peter Reutemann, <[hidden email]> wrote:
> I did not figure out the goal of "attributeNamePrefix" (Prefix for the created attribute names) of "StringToWordVector". How to use this option?, example on that would help.

The StringToWordVector generates a lot of new attributes, based on the
words (actually tokens) it finds in the string attribute(s). If your
data already contains other attributes, then it is possible that newly
created attributes will create a clash, as Weka does not allow
duplicate attributes names in datasets. "attributeNamePrefix" can be
used to disambiguate the newly generated attribute by prefixing them
with the supplied string.
For example, you could use "w-" as prefix. Assuming that the text
contains the two words "hello world". Instead of generating two
attributes called "hello" and "world" it will generate "w-hello" and
"w-world".

Cheers, Peter
--
Peter Reutemann
Dept. of Computer Science
University of Waikato, NZ
+64 (7) 858-5174
http://www.cms.waikato.ac.nz/~fracpete/
http://www.data-mining.co.nz/
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Help with attributeNamePrefix

Peter Reutemann
> In line with Valerio's question, reading about "periodicPruning", a parameter of StringToWords filter, is still not clear. The aim behind this pruning and its differences from the -w is not really clear also. Hence, could you please clarify the main goal of this parameter and example on its job.

The dictionary builder used under the hood is a streaming algorithm,
i.e., it processes rows one at a time.
If the pruning rate is > 0 then every X instances (calculated from
that rate), the dictionary will get pruned.
Default rate is -1, i.e., off.

Cheers, Peter
--
Peter Reutemann
Dept. of Computer Science
University of Waikato, NZ
+64 (7) 858-5174
http://www.cms.waikato.ac.nz/~fracpete/
http://www.data-mining.co.nz/
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Help with attributeNamePrefix

Edward Wiskers

The dictionary builder used under the hood is a streaming algorithm,
i.e., it processes rows one at a time.
If the pruning rate is > 0 then every X instances (calculated from
that rate), the dictionary will get pruned.
Default rate is -1, i.e., off.
 

Thank you Peter.

1- This approach is mainly designed to reduce the size of the dictionary if it is activated. Correct?


2- By default it is not activated, installed of it the parameter -w is doing the pruning job. Right?

Thanks you. 

Cheers, Peter
--
Peter Reutemann
Dept. of Computer Science
University of Waikato, NZ
+64 (7) 858-5174
http://www.cms.waikato.ac.nz/~fracpete/
http://www.data-mining.co.nz/
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Help with attributeNamePrefix

Peter Reutemann
> 1- This approach is mainly designed to reduce the size of the dictionary if it is activated. Correct?
>
> 2- By default it is not activated, installed of it the parameter -w is doing the pruning job. Right?

The dictionary is *always* created - otherwise you cannot determine
which words to use.

Copy/paste from the StringToWordVector Javadoc:

 -prune-rate <rate as a percentage of dataset>
  Specify the rate (e.g., every 10% of the input dataset) at which to
periodically prune the dictionary.
  -W prunes after creating a full dictionary. You may not have enough
memory for this approach.
  (default: no periodic pruning)

https://weka.sourceforge.io/doc.dev/weka/filters/unsupervised/attribute/StringToWordVector.html

Cheers, Peter
--
Peter Reutemann
Dept. of Computer Science
University of Waikato, NZ
+64 (7) 858-5174
http://www.cms.waikato.ac.nz/~fracpete/
http://www.data-mining.co.nz/
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Help with attributeNamePrefix

Edward Wiskers
Thank you so much Peter. 

Things are clear now. 


Cheers, 
Edward 

On Mon, 4 May 2020, 4:38 am Peter Reutemann, <[hidden email]> wrote:
> 1- This approach is mainly designed to reduce the size of the dictionary if it is activated. Correct?
>
> 2- By default it is not activated, installed of it the parameter -w is doing the pruning job. Right?

The dictionary is *always* created - otherwise you cannot determine
which words to use.

Copy/paste from the StringToWordVector Javadoc:

 -prune-rate <rate as a percentage of dataset>
  Specify the rate (e.g., every 10% of the input dataset) at which to
periodically prune the dictionary.
  -W prunes after creating a full dictionary. You may not have enough
memory for this approach.
  (default: no periodic pruning)

https://weka.sourceforge.io/doc.dev/weka/filters/unsupervised/attribute/StringToWordVector.html

Cheers, Peter
--
Peter Reutemann
Dept. of Computer Science
University of Waikato, NZ
+64 (7) 858-5174
http://www.cms.waikato.ac.nz/~fracpete/
http://www.data-mining.co.nz/
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Help with attributeNamePrefix

valerio jus
In reply to this post by Peter Reutemann
Hi Peter, 

I found a strange result in StringToWordVector. Precisely, when uploaded data of only one sentence that is consisted of 4 words and trying to use the "attributeIndices" of StringToWordVector by specifying it at 3,4 indices, I had this error message:

 Problem filtering instances: Invalid range list at 3


How to solve this issue?

Thanks in advance.
Valerio

On Fri, May 1, 2020 at 7:53 AM Peter Reutemann <[hidden email]> wrote:
> I did not figure out the goal of "attributeNamePrefix" (Prefix for the created attribute names) of "StringToWordVector". How to use this option?, example on that would help.

The StringToWordVector generates a lot of new attributes, based on the
words (actually tokens) it finds in the string attribute(s). If your
data already contains other attributes, then it is possible that newly
created attributes will create a clash, as Weka does not allow
duplicate attributes names in datasets. "attributeNamePrefix" can be
used to disambiguate the newly generated attribute by prefixing them
with the supplied string.
For example, you could use "w-" as prefix. Assuming that the text
contains the two words "hello world". Instead of generating two
attributes called "hello" and "world" it will generate "w-hello" and
"w-world".

Cheers, Peter
--
Peter Reutemann
Dept. of Computer Science
University of Waikato, NZ
+64 (7) 858-5174
http://www.cms.waikato.ac.nz/~fracpete/
http://www.data-mining.co.nz/
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Help with attributeNamePrefix

Peter Reutemann
> I found a strange result in StringToWordVector. Precisely, when uploaded data of only one sentence that is consisted of 4 words and trying to use the "attributeIndices" of StringToWordVector by specifying it at 3,4 indices, I had this error message:
>
>  Problem filtering instances: Invalid range list at 3

So your dataset has at least 4 attributes, with STRING attributes at
positions 3 and 4 (using 1-based indices)?

Cheers, Peter
--
Peter Reutemann
Dept. of Computer Science
University of Waikato, NZ
+64 (7) 858-5174
http://www.cms.waikato.ac.nz/~fracpete/
http://www.data-mining.co.nz/
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Help with attributeNamePrefix

valerio jus
Hi Peter,

I have only one instance or sentence with string type. This sentence has 4 words.


After applying "StringToWordVector" with specifying the "attributeIndices" parameter with "3,4" indices ("first-last" is the default value), I had this error message:
Problem filtering instances: Invalid range list at 3. 


I am not sure what I am missing.

Any help would be appreciated. 

Valerio 


On Fri, 8 May 2020, 4:36 am Peter Reutemann, <[hidden email]> wrote:
> I found a strange result in StringToWordVector. Precisely, when uploaded data of only one sentence that is consisted of 4 words and trying to use the "attributeIndices" of StringToWordVector by specifying it at 3,4 indices, I had this error message:
>
>  Problem filtering instances: Invalid range list at 3

So your dataset has at least 4 attributes, with STRING attributes at
positions 3 and 4 (using 1-based indices)?

Cheers, Peter
--
Peter Reutemann
Dept. of Computer Science
University of Waikato, NZ
+64 (7) 858-5174
http://www.cms.waikato.ac.nz/~fracpete/
http://www.data-mining.co.nz/
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Help with attributeNamePrefix

Peter Reutemann
> I have only one instance or sentence with string type. This sentence has 4 words.
>
>
> After applying "StringToWordVector" with specifying the "attributeIndices" parameter with "3,4" indices ("first-last" is the default value), I had this error message:
> Problem filtering instances: Invalid range list at 3.
>
>
> I am not sure what I am missing.

How many attributes (columns, not rows!) does your input dataset have?
You're telling the filter to process STRING columns 3 and 4 in your
dataset. If you don't have have at least 4 columns, this will fail.

Cheers, Peter
--
Peter Reutemann
Dept. of Computer Science
University of Waikato, NZ
+64 (7) 858-5174
http://www.cms.waikato.ac.nz/~fracpete/
http://www.data-mining.co.nz/
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Help with attributeNamePrefix

valerio jus
Thank you so much, Peter. 

I have data that consist of 7 features (except the class). I need now to select range like this: (1-3)&(4-6) but Weka produced this error:

Problem filtering instances: Invalid range list at (1-3)&(4-6)

How to fix this error? How to make a range of two sets?

Thanks again.
Valerio

On Fri, May 8, 2020 at 5:36 AM Peter Reutemann <[hidden email]> wrote:
> I have only one instance or sentence with string type. This sentence has 4 words.
>
>
> After applying "StringToWordVector" with specifying the "attributeIndices" parameter with "3,4" indices ("first-last" is the default value), I had this error message:
> Problem filtering instances: Invalid range list at 3.
>
>
> I am not sure what I am missing.

How many attributes (columns, not rows!) does your input dataset have?
You're telling the filter to process STRING columns 3 and 4 in your
dataset. If you don't have have at least 4 columns, this will fail.

Cheers, Peter
--
Peter Reutemann
Dept. of Computer Science
University of Waikato, NZ
+64 (7) 858-5174
http://www.cms.waikato.ac.nz/~fracpete/
http://www.data-mining.co.nz/
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Help with attributeNamePrefix

Peter Reutemann
> I have data that consist of 7 features (except the class). I need now to select range like this: (1-3)&(4-6) but Weka produced this error:
>
> Problem filtering instances: Invalid range list at (1-3)&(4-6)

1-3,4-6

Cheers, Peter
--
Peter Reutemann
Dept. of Computer Science
University of Waikato, NZ
+64 (7) 858-5174
http://www.cms.waikato.ac.nz/~fracpete/
http://www.data-mining.co.nz/
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Help with attributeNamePrefix

valerio jus
In reply to this post by valerio jus
Wonderful. Thank you so much Peter. Solved perfectly.

Appreciate your help.

Have a nice day and stay safe. 

Valerio

On Fri, May 8, 2020 at 5:20 AM Valerio jus <[hidden email]> wrote:
Hi Peter,

I have only one instance or sentence with string type. This sentence has 4 words.


After applying "StringToWordVector" with specifying the "attributeIndices" parameter with "3,4" indices ("first-last" is the default value), I had this error message:
Problem filtering instances: Invalid range list at 3. 


I am not sure what I am missing.

Any help would be appreciated. 

Valerio 


On Fri, 8 May 2020, 4:36 am Peter Reutemann, <[hidden email]> wrote:
> I found a strange result in StringToWordVector. Precisely, when uploaded data of only one sentence that is consisted of 4 words and trying to use the "attributeIndices" of StringToWordVector by specifying it at 3,4 indices, I had this error message:
>
>  Problem filtering instances: Invalid range list at 3

So your dataset has at least 4 attributes, with STRING attributes at
positions 3 and 4 (using 1-based indices)?

Cheers, Peter
--
Peter Reutemann
Dept. of Computer Science
University of Waikato, NZ
+64 (7) 858-5174
http://www.cms.waikato.ac.nz/~fracpete/
http://www.data-mining.co.nz/
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html