Dealing with txtual data

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Dealing with txtual data

valerio jus
Hi all,

In general, when dealing with textual data, is the process of converting the words into a lowercase form comes before tokenization?

Thanks in advance.
Valerio

_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Dealing with txtual data

Eibe Frank-2
Administrator
Here is the key bit of code from the DictionaryBuilder class:

if (m_selectedRange.isInRange(j) && !inst.isMissing(j)) {
  m_tokenizer.tokenize(inst.stringValue(j));

  while (m_tokenizer.hasMoreElements()) {
    String word = m_tokenizer.nextElement();

    if (m_lowerCaseTokens) {
      word = word.toLowerCase();
    }
    word = m_stemmer.stem(word);
    if (m_stopwordsHandler.isStopword(word)) {
      continue;
    }

    int[] counts = m_inputVector.get(word);
    if (counts == null) {
      counts = new int[2];
      counts[0] = 1; // word count
      counts[1] = 1; // doc count
      m_inputVector.put(word, counts);
    } else {
      counts[0]++;
    }
  }
}

So, the tokens that are returned by the tokeniser are turned to lowercase (if the user has chosen to do so) and stemmed before they are compared to the list of stop words (and skipped if they match).

Cheers,
Eibe

> On 25/09/2019, at 2:10 PM, Valerio jus <[hidden email]> wrote:
>
> Hi all,
>
> In general, when dealing with textual data, is the process of converting the words into a lowercase form comes before tokenization?
>
> Thanks in advance.
> Valerio
> _______________________________________________
> Wekalist mailing list -- [hidden email]
> Send posts to: To unsubscribe send an email to [hidden email]
> To subscribe, unsubscribe, etc., visit
> https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
> List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Dealing with txtual data

valerio jus



Here is the key bit of code from the DictionaryBuilder class:

if (m_selectedRange.isInRange(j) && !inst.isMissing(j)) {
  m_tokenizer.tokenize(inst.stringValue(j));

  while (m_tokenizer.hasMoreElements()) {
    String word = m_tokenizer.nextElement();

    if (m_lowerCaseTokens) {
      word = word.toLowerCase();
    }
    word = m_stemmer.stem(word);
    if (m_stopwordsHandler.isStopword(word)) {
      continue;
    }

    int[] counts = m_inputVector.get(word);
    if (counts == null) {
      counts = new int[2];
      counts[0] = 1; // word count
      counts[1] = 1; // doc count
      m_inputVector.put(word, counts);
    } else {
      counts[0]++;
    }
  }
}

So, the tokens that are returned by the tokeniser are turned to lowercase (if the user has chosen to do so) and stemmed before they are compared to the list of stop words (and skipped if they match).

Eibe, thank you for the prompt reply. 
In line with your explanation, if the user selected to normalize the training set (doc length), this should be accomplished after having lowercase tokens (and before stop words). (Assuming that the user needs to stop words, stemming, and normalizing the length of the documents.). Is that correct?

Thanks. 
Valerio

 
Cheers,
Eibe

> On 25/09/2019, at 2:10 PM, Valerio jus <[hidden email]> wrote:
>
> Hi all,
>
> In general, when dealing with textual data, is the process of converting the words into a lowercase form comes before tokenization?
>
> Thanks in advance.
> Valerio
> _______________________________________________
> Wekalist mailing list -- [hidden email]
> Send posts to: To unsubscribe send an email to [hidden email]
> To subscribe, unsubscribe, etc., visit
> https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
> List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Dealing with txtual data

Eibe Frank-2
Administrator
No, vector normalisation in StringToWordVector is performed once all the attribute values have been computed. The attribute values are computed based on the counts established in the code snippet I’ve posted.

Cheers,
Eibe

> On 25/09/2019, at 3:28 PM, Valerio jus <[hidden email]> wrote:
>
>
>
>
> Here is the key bit of code from the DictionaryBuilder class:
>
> if (m_selectedRange.isInRange(j) && !inst.isMissing(j)) {
>   m_tokenizer.tokenize(inst.stringValue(j));
>
>   while (m_tokenizer.hasMoreElements()) {
>     String word = m_tokenizer.nextElement();
>
>     if (m_lowerCaseTokens) {
>       word = word.toLowerCase();
>     }
>     word = m_stemmer.stem(word);
>     if (m_stopwordsHandler.isStopword(word)) {
>       continue;
>     }
>
>     int[] counts = m_inputVector.get(word);
>     if (counts == null) {
>       counts = new int[2];
>       counts[0] = 1; // word count
>       counts[1] = 1; // doc count
>       m_inputVector.put(word, counts);
>     } else {
>       counts[0]++;
>     }
>   }
> }
>
> So, the tokens that are returned by the tokeniser are turned to lowercase (if the user has chosen to do so) and stemmed before they are compared to the list of stop words (and skipped if they match).
>
> Eibe, thank you for the prompt reply.
> In line with your explanation, if the user selected to normalize the training set (doc length), this should be accomplished after having lowercase tokens (and before stop words). (Assuming that the user needs to stop words, stemming, and normalizing the length of the documents.). Is that correct?
>
> Thanks.
> Valerio
>
>  
> Cheers,
> Eibe
>
> > On 25/09/2019, at 2:10 PM, Valerio jus <[hidden email]> wrote:
> >
> > Hi all,
> >
> > In general, when dealing with textual data, is the process of converting the words into a lowercase form comes before tokenization?
> >
> > Thanks in advance.
> > Valerio
> > _______________________________________________
> > Wekalist mailing list -- [hidden email]
> > Send posts to: To unsubscribe send an email to [hidden email]
> > To subscribe, unsubscribe, etc., visit
> > https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
> > List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
> _______________________________________________
> Wekalist mailing list -- [hidden email]
> Send posts to: To unsubscribe send an email to [hidden email]
> To subscribe, unsubscribe, etc., visit
> https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
> List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
> _______________________________________________
> Wekalist mailing list -- [hidden email]
> Send posts to: To unsubscribe send an email to [hidden email]
> To subscribe, unsubscribe, etc., visit
> https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
> List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Dealing with txtual data

valerio jus
This means vector normalisation is always that last stage to be accomplished whatever the other performed stages (ie, stop words, stemming, etc.)?

Thank you. 
Valerio

On Wed, Sep 25, 2019 at 11:43 AM Eibe Frank <[hidden email]> wrote:
No, vector normalisation in StringToWordVector is performed once all the attribute values have been computed. The attribute values are computed based on the counts established in the code snippet I’ve posted.

Cheers,
Eibe

> On 25/09/2019, at 3:28 PM, Valerio jus <[hidden email]> wrote:
>
>
>
>
> Here is the key bit of code from the DictionaryBuilder class:
>
> if (m_selectedRange.isInRange(j) && !inst.isMissing(j)) {
>   m_tokenizer.tokenize(inst.stringValue(j));
>
>   while (m_tokenizer.hasMoreElements()) {
>     String word = m_tokenizer.nextElement();
>
>     if (m_lowerCaseTokens) {
>       word = word.toLowerCase();
>     }
>     word = m_stemmer.stem(word);
>     if (m_stopwordsHandler.isStopword(word)) {
>       continue;
>     }
>
>     int[] counts = m_inputVector.get(word);
>     if (counts == null) {
>       counts = new int[2];
>       counts[0] = 1; // word count
>       counts[1] = 1; // doc count
>       m_inputVector.put(word, counts);
>     } else {
>       counts[0]++;
>     }
>   }
> }
>
> So, the tokens that are returned by the tokeniser are turned to lowercase (if the user has chosen to do so) and stemmed before they are compared to the list of stop words (and skipped if they match).
>
> Eibe, thank you for the prompt reply.
> In line with your explanation, if the user selected to normalize the training set (doc length), this should be accomplished after having lowercase tokens (and before stop words). (Assuming that the user needs to stop words, stemming, and normalizing the length of the documents.). Is that correct?
>
> Thanks.
> Valerio
>

> Cheers,
> Eibe
>
> > On 25/09/2019, at 2:10 PM, Valerio jus <[hidden email]> wrote:
> >
> > Hi all,
> >
> > In general, when dealing with textual data, is the process of converting the words into a lowercase form comes before tokenization?
> >
> > Thanks in advance.
> > Valerio
> > _______________________________________________
> > Wekalist mailing list -- [hidden email]
> > Send posts to: To unsubscribe send an email to [hidden email]
> > To subscribe, unsubscribe, etc., visit
> > https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
> > List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
> _______________________________________________
> Wekalist mailing list -- [hidden email]
> Send posts to: To unsubscribe send an email to [hidden email]
> To subscribe, unsubscribe, etc., visit
> https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
> List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
> _______________________________________________
> Wekalist mailing list -- [hidden email]
> Send posts to: To unsubscribe send an email to [hidden email]
> To subscribe, unsubscribe, etc., visit
> https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
> List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Dealing with txtual data

Eibe Frank-2
Administrator
Yes.

Cheers,
Eibe

> On 25/09/2019, at 3:46 PM, Valerio jus <[hidden email]> wrote:
>
> This means vector normalisation is always that last stage to be accomplished whatever the other performed stages (ie, stop words, stemming, etc.)?
>
> Thank you.
> Valerio
>
> On Wed, Sep 25, 2019 at 11:43 AM Eibe Frank <[hidden email]> wrote:
> No, vector normalisation in StringToWordVector is performed once all the attribute values have been computed. The attribute values are computed based on the counts established in the code snippet I’ve posted.
>
> Cheers,
> Eibe
>
> > On 25/09/2019, at 3:28 PM, Valerio jus <[hidden email]> wrote:
> >
> >
> >
> >
> > Here is the key bit of code from the DictionaryBuilder class:
> >
> > if (m_selectedRange.isInRange(j) && !inst.isMissing(j)) {
> >   m_tokenizer.tokenize(inst.stringValue(j));
> >
> >   while (m_tokenizer.hasMoreElements()) {
> >     String word = m_tokenizer.nextElement();
> >
> >     if (m_lowerCaseTokens) {
> >       word = word.toLowerCase();
> >     }
> >     word = m_stemmer.stem(word);
> >     if (m_stopwordsHandler.isStopword(word)) {
> >       continue;
> >     }
> >
> >     int[] counts = m_inputVector.get(word);
> >     if (counts == null) {
> >       counts = new int[2];
> >       counts[0] = 1; // word count
> >       counts[1] = 1; // doc count
> >       m_inputVector.put(word, counts);
> >     } else {
> >       counts[0]++;
> >     }
> >   }
> > }
> >
> > So, the tokens that are returned by the tokeniser are turned to lowercase (if the user has chosen to do so) and stemmed before they are compared to the list of stop words (and skipped if they match).
> >
> > Eibe, thank you for the prompt reply.
> > In line with your explanation, if the user selected to normalize the training set (doc length), this should be accomplished after having lowercase tokens (and before stop words). (Assuming that the user needs to stop words, stemming, and normalizing the length of the documents.). Is that correct?
> >
> > Thanks.
> > Valerio
> >
> >  
> > Cheers,
> > Eibe
> >
> > > On 25/09/2019, at 2:10 PM, Valerio jus <[hidden email]> wrote:
> > >
> > > Hi all,
> > >
> > > In general, when dealing with textual data, is the process of converting the words into a lowercase form comes before tokenization?
> > >
> > > Thanks in advance.
> > > Valerio
> > > _______________________________________________
> > > Wekalist mailing list -- [hidden email]
> > > Send posts to: To unsubscribe send an email to [hidden email]
> > > To subscribe, unsubscribe, etc., visit
> > > https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
> > > List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
> > _______________________________________________
> > Wekalist mailing list -- [hidden email]
> > Send posts to: To unsubscribe send an email to [hidden email]
> > To subscribe, unsubscribe, etc., visit
> > https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
> > List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
> > _______________________________________________
> > Wekalist mailing list -- [hidden email]
> > Send posts to: To unsubscribe send an email to [hidden email]
> > To subscribe, unsubscribe, etc., visit
> > https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
> > List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
> _______________________________________________
> Wekalist mailing list -- [hidden email]
> Send posts to: To unsubscribe send an email to [hidden email]
> To subscribe, unsubscribe, etc., visit
> https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
> List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
> _______________________________________________
> Wekalist mailing list -- [hidden email]
> Send posts to: To unsubscribe send an email to [hidden email]
> To subscribe, unsubscribe, etc., visit
> https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
> List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Dealing with txtual data

valerio jus
Thanks.
Valerio

On Wed, Sep 25, 2019 at 11:52 AM Eibe Frank <[hidden email]> wrote:
Yes.

Cheers,
Eibe

> On 25/09/2019, at 3:46 PM, Valerio jus <[hidden email]> wrote:
>
> This means vector normalisation is always that last stage to be accomplished whatever the other performed stages (ie, stop words, stemming, etc.)?
>
> Thank you.
> Valerio
>
> On Wed, Sep 25, 2019 at 11:43 AM Eibe Frank <[hidden email]> wrote:
> No, vector normalisation in StringToWordVector is performed once all the attribute values have been computed. The attribute values are computed based on the counts established in the code snippet I’ve posted.
>
> Cheers,
> Eibe
>
> > On 25/09/2019, at 3:28 PM, Valerio jus <[hidden email]> wrote:
> >
> >
> >
> >
> > Here is the key bit of code from the DictionaryBuilder class:
> >
> > if (m_selectedRange.isInRange(j) && !inst.isMissing(j)) {
> >   m_tokenizer.tokenize(inst.stringValue(j));
> >
> >   while (m_tokenizer.hasMoreElements()) {
> >     String word = m_tokenizer.nextElement();
> >
> >     if (m_lowerCaseTokens) {
> >       word = word.toLowerCase();
> >     }
> >     word = m_stemmer.stem(word);
> >     if (m_stopwordsHandler.isStopword(word)) {
> >       continue;
> >     }
> >
> >     int[] counts = m_inputVector.get(word);
> >     if (counts == null) {
> >       counts = new int[2];
> >       counts[0] = 1; // word count
> >       counts[1] = 1; // doc count
> >       m_inputVector.put(word, counts);
> >     } else {
> >       counts[0]++;
> >     }
> >   }
> > }
> >
> > So, the tokens that are returned by the tokeniser are turned to lowercase (if the user has chosen to do so) and stemmed before they are compared to the list of stop words (and skipped if they match).
> >
> > Eibe, thank you for the prompt reply.
> > In line with your explanation, if the user selected to normalize the training set (doc length), this should be accomplished after having lowercase tokens (and before stop words). (Assuming that the user needs to stop words, stemming, and normalizing the length of the documents.). Is that correct?
> >
> > Thanks.
> > Valerio
> >
> > 
> > Cheers,
> > Eibe
> >
> > > On 25/09/2019, at 2:10 PM, Valerio jus <[hidden email]> wrote:
> > >
> > > Hi all,
> > >
> > > In general, when dealing with textual data, is the process of converting the words into a lowercase form comes before tokenization?
> > >
> > > Thanks in advance.
> > > Valerio
> > > _______________________________________________
> > > Wekalist mailing list -- [hidden email]
> > > Send posts to: To unsubscribe send an email to [hidden email]
> > > To subscribe, unsubscribe, etc., visit
> > > https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
> > > List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
> > _______________________________________________
> > Wekalist mailing list -- [hidden email]
> > Send posts to: To unsubscribe send an email to [hidden email]
> > To subscribe, unsubscribe, etc., visit
> > https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
> > List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
> > _______________________________________________
> > Wekalist mailing list -- [hidden email]
> > Send posts to: To unsubscribe send an email to [hidden email]
> > To subscribe, unsubscribe, etc., visit
> > https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
> > List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
> _______________________________________________
> Wekalist mailing list -- [hidden email]
> Send posts to: To unsubscribe send an email to [hidden email]
> To subscribe, unsubscribe, etc., visit
> https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
> List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
> _______________________________________________
> Wekalist mailing list -- [hidden email]
> Send posts to: To unsubscribe send an email to [hidden email]
> To subscribe, unsubscribe, etc., visit
> https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
> List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html