Issues with data and algorithm

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Issues with data and algorithm

andria lan
Hi all, 

I have the following issue:

1- I applied the Logistic regression algorithm on my textual data, then
2- to each instance, I added either "tt" or "gg".
3- After comparing step 1 and 2, I observed that Logistic performed better with step 2. In addition, I observed that every time I add additional letters to each instance, the Logistic gets improved  (the more I add a letter, the more the Logistic's results get improved). 
 
My questions:

1- Why such improvement occurred to the Logistic method in my scenario?

2- Is adding the letters to each instance, i.e., adding the "tt" or "gg", is considered adding extra weight (or perhaps adding extra features) to the instances and though changing data characteristics in a way that affects the performance of the Logistic method? Or

3- Can we say that the Logistic method is not able to weight the original features itself properly, while when we manually add extra letters we provide extra weights to the specific instance, and this, in turn, alerts the Logistic and performs better later on?  

Any help would be highly appreciated. 

Thanks.
Andria



_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Issues with data and algorithm

Eibe Frank-2
Administrator
So you add “tt” for instances belong to one class, and “gg” for instances belonging to the other? And then you use StringToWordVector with the CharacterNGramTokenizer? If that’s the case, your result makes sense because the filtered data will contain two perfectly discriminating attributes. Rather than adding additional letters, you can probably also decrease the ridge value to allow Logistic to give a higher weight to these perfect discriminators.

Cheers,
Eibe

> On 14/06/2017, at 5:00 PM, Andria Lan <[hidden email]> wrote:
>
> Hi all,
>
> I have the following issue:
>
> 1- I applied the Logistic regression algorithm on my textual data, then
> 2- to each instance, I added either "tt" or "gg".
> 3- After comparing step 1 and 2, I observed that Logistic performed better with step 2. In addition, I observed that every time I add additional letters to each instance, the Logistic gets improved  (the more I add a letter, the more the Logistic's results get improved).
>  
> My questions:
>
> 1- Why such improvement occurred to the Logistic method in my scenario?
>
> 2- Is adding the letters to each instance, i.e., adding the "tt" or "gg", is considered adding extra weight (or perhaps adding extra features) to the instances and though changing data characteristics in a way that affects the performance of the Logistic method? Or
>
> 3- Can we say that the Logistic method is not able to weight the original features itself properly, while when we manually add extra letters we provide extra weights to the specific instance, and this, in turn, alerts the Logistic and performs better later on?  
>
> Any help would be highly appreciated.
>
> Thanks.
> Andria
>
>
> _______________________________________________
> Wekalist mailing list
> Send posts to: [hidden email]
> List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
> List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Issues with data and algorithm

andria lan
Hi Eibe, 

Thank you very much for the propt reply. Please find the detailed reply below. 

So you add “tt” for instances belong to one class, and “gg” for instances belonging to the other? And then you use StringToWordVector with the CharacterNGramTokenizer?

Yes, I did that, but with the use of TF/IDF. I have here one question: why diid you assume the use of CharacterNGramTokenizer?
 
If that’s the case, your result makes sense because the filtered data will contain two perfectly discriminating attributes. Rather than adding additional letters, you can probably also decrease the ridge value to allow Logistic to give a higher weight to these perfect discriminators.

So do you think it is good for me to get back to the original data and only decrease the ridge value (without adding letters like "tt" or "gg")?


Thanks.
Andria


 

Cheers,
Eibe

> On 14/06/2017, at 5:00 PM, Andria Lan <[hidden email]> wrote:
>
> Hi all,
>
> I have the following issue:
>
> 1- I applied the Logistic regression algorithm on my textual data, then
> 2- to each instance, I added either "tt" or "gg".
> 3- After comparing step 1 and 2, I observed that Logistic performed better with step 2. In addition, I observed that every time I add additional letters to each instance, the Logistic gets improved  (the more I add a letter, the more the Logistic's results get improved).
>
> My questions:
>
> 1- Why such improvement occurred to the Logistic method in my scenario?
>
> 2- Is adding the letters to each instance, i.e., adding the "tt" or "gg", is considered adding extra weight (or perhaps adding extra features) to the instances and though changing data characteristics in a way that affects the performance of the Logistic method? Or
>
> 3- Can we say that the Logistic method is not able to weight the original features itself properly, while when we manually add extra letters we provide extra weights to the specific instance, and this, in turn, alerts the Logistic and performs better later on?
>
> Any help would be highly appreciated.
>
> Thanks.
> Andria
>
>
> _______________________________________________
> Wekalist mailing list
> Send posts to: [hidden email]
> List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
> List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html


_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Issues with data and algorithm

Eibe Frank-2
Administrator

> On 14/06/2017, at 5:24 PM, Andria Lan <[hidden email]> wrote:
>
> So you add “tt” for instances belong to one class, and “gg” for instances belonging to the other? And then you use StringToWordVector with the CharacterNGramTokenizer?
>
> Yes, I did that, but with the use of TF/IDF. I have here one question: why diid you assume the use of CharacterNGramTokenizer?

Adding additional letters to a “word” that is unique anyway should not make a difference in the accuracy obtained but with the CharacterNGramTokenizer, you could get tokens such as “gg” and “ggg” from an input string that contains “ggg”.

> If that’s the case, your result makes sense because the filtered data will contain two perfectly discriminating attributes. Rather than adding additional letters, you can probably also decrease the ridge value to allow Logistic to give a higher weight to these perfect discriminators.
>
> So do you think it is good for me to get back to the original data and only decrease the ridge value (without adding letters like "tt" or "gg”)?

Not necessarily. Normally, it will just increase overfitting. It will only help if some existing words in your data are highly discriminative, which is probably not the case.

Cheers,
Eibe

_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Issues with data and algorithm

andria lan
Many thanks Eibe for your help, it solved my problem. 

God bless you.

Andria

On Wed, Jun 14, 2017 at 1:29 PM, Eibe Frank <[hidden email]> wrote:

> On 14/06/2017, at 5:24 PM, Andria Lan <[hidden email]> wrote:
>
> So you add “tt” for instances belong to one class, and “gg” for instances belonging to the other? And then you use StringToWordVector with the CharacterNGramTokenizer?
>
> Yes, I did that, but with the use of TF/IDF. I have here one question: why diid you assume the use of CharacterNGramTokenizer?

Adding additional letters to a “word” that is unique anyway should not make a difference in the accuracy obtained but with the CharacterNGramTokenizer, you could get tokens such as “gg” and “ggg” from an input string that contains “ggg”.

> If that’s the case, your result makes sense because the filtered data will contain two perfectly discriminating attributes. Rather than adding additional letters, you can probably also decrease the ridge value to allow Logistic to give a higher weight to these perfect discriminators.
>
> So do you think it is good for me to get back to the original data and only decrease the ridge value (without adding letters like "tt" or "gg”)?

Not necessarily. Normally, it will just increase overfitting. It will only help if some existing words in your data are highly discriminative, which is probably not the case.

Cheers,
Eibe

_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html


_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html