Logistic Regression equation - binarization of nominal attributes

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Logistic Regression equation - binarization of nominal attributes

Luisa
The probabilities of a logistic regression are given by:

P(1| a,b,c) = 1 / (1 + Exp(w0 + w1*a + w2*b ))   (Eq1)
where the ws are the weights, and a, b are the attribute values.

I have a database with 3 attributes and a class: attribute a = CATEG (nominal with 3 different values = multipara/primipara/novilha), attribute b = RACATO (nominal with 2 different values = nelore/angus). Class (DG  is binary: 0/1)

I have run logistic regression in weka and obtain the following weights:

                     Class
Variable                 1
==========================
CATEG=MULTIPARA    -0.1952
CATEG=PRIMIPARA     0.1182
CATEG=NOVILHA       0.1953
RACATO=NELORE      -0.0599
Intercept           0.4637

I do understand that if I have the following instance
CATEG= MULTIPARA   RACATO = NELORE

then the probability of this instance having class 1 is 
P(DG=1| instance)  = 1/ (1+ Exp(0.4637 +  -0.1952*1 +  -0.0599*1))

However, if I change RACATO to ANGUS then, my expression loses weight -0.0599, this is as if Nelore has a value of 1 and Angus value of 0, which makes sense since I have a binary nominal value.
P(DG=1| instance)  = 1/ (1+ Exp(0.4637 +  -0.1952*1 + -0.1403*1 + -0.0599*0))  
Therefore, as in Eq 1, weight w2 = -0.0599 and the B attribute is binary and corresponds to RACATO, being 1 to NELORE and 0 to ANGUS

However, If I have the instance where CATEG is changed from MULTIPARA to PRIMIPARA
CATEG= PRIMIPARA  RACATO = NELORE

weka seems to give me another weight. Instead of -0.1952 for CATEG = MULTIPARA, it shows weight  0.1182 for CATEG = PRIMIPARA, and the probability is given by
P(DG=1| instance)  = 1/ (1+ Exp(0.1182+  -0.1952*1 + -0.1403*1 + -0.0599*1)) 

Therefore, weight w1 in Eq 1 is not a fixed value.I expected that w1 was a constant and attribute a assumed value 1 for MULTIPARA, 2 for PRIMIPARA, and 3 for NOVILHA. Instead of what I expected, the attribute in Eq1 for CATEG does not assume values 1, 2, and 3 for MULTIPARA, PRIMIPARA, and NOVILHA respectfully. 

That is, the true equation is not Eq1. The true equation is given by Eq2:
  P(1| a,b,c) = 1 / (1 + Exp(w0 + w11*a1 + w12*a2+ w13*a3 + w2*b + w3*c))   (Eq2)  
where a1, a2 and a3 are binary variables for CATEG = MULTIPARAS, PRIMIPARAS, AND NOVILHAS

In R, we can do logistic regression, and the results will appear as Eq. 1 and not as weka's Eq.2, where nominal attributes are binarized.

Is there a way to make the regression similar to R? Because this is easy to interpret when an attribute is nominal with 3 possible values, however, when the attribute has 87 possible values, things start to get messy. In my case, I have a couple more attributes with 87 possible values and 20 possible values. And things get ugly.

Cheers,

Luisa





Virus-free. www.avast.com

_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Logistic Regression equation - binarization of nominal attributes

Eibe Frank-2
Administrator
Yes, you are right, Logistic in WEKA binarizes nominal attributes. You can see exactly what happens by running the unsupervised NominalToBinary filter on your data, e.g., in the Preprocess panel.

Coding the nominal values as integers 1, 2, and 3 seems problematic when using logistic regression because it assumes interval scale. However, there is an “OrdinalToNumeric” filter that you can apply to achieve this.

It might be better to merge nominal values instead. There is a supervised MergeNominalValues filter that applies a greedy method based on the chi-squared test to merge categories of nominal attributes. Just make sure you apply this *supervised* filter as part of the FilteredClassifier to avoid optimistic performance estimates.

Cheers,
Eibe

On 5/06/2020, at 1:34 AM, Luisa <[hidden email]> wrote:

The probabilities of a logistic regression are given by:

P(1| a,b,c) = 1 / (1 + Exp(w0 + w1*a + w2*b ))   (Eq1)
where the ws are the weights, and a, b are the attribute values.

I have a database with 3 attributes and a class: attribute a = CATEG (nominal with 3 different values = multipara/primipara/novilha), attribute b = RACATO (nominal with 2 different values = nelore/angus). Class (DG  is binary: 0/1)

I have run logistic regression in weka and obtain the following weights:

                     Class
Variable                 1
==========================
CATEG=MULTIPARA    -0.1952
CATEG=PRIMIPARA     0.1182
CATEG=NOVILHA       0.1953
RACATO=NELORE      -0.0599
Intercept           0.4637

I do understand that if I have the following instance
CATEG= MULTIPARA   RACATO = NELORE

then the probability of this instance having class 1 is 
P(DG=1| instance)  = 1/ (1+ Exp(0.4637 +  -0.1952*1 +  -0.0599*1))

However, if I change RACATO to ANGUS then, my expression loses weight -0.0599, this is as if Nelore has a value of 1 and Angus value of 0, which makes sense since I have a binary nominal value.
P(DG=1| instance)  = 1/ (1+ Exp(0.4637 +  -0.1952*1 + -0.1403*1 + -0.0599*0))  
Therefore, as in Eq 1, weight w2 = -0.0599 and the B attribute is binary and corresponds to RACATO, being 1 to NELORE and 0 to ANGUS

However, If I have the instance where CATEG is changed from MULTIPARA to PRIMIPARA
CATEG= PRIMIPARA  RACATO = NELORE

weka seems to give me another weight. Instead of -0.1952 for CATEG = MULTIPARA, it shows weight  0.1182 for CATEG = PRIMIPARA, and the probability is given by
P(DG=1| instance)  = 1/ (1+ Exp(0.1182+  -0.1952*1 + -0.1403*1 + -0.0599*1)) 

Therefore, weight w1 in Eq 1 is not a fixed value.I expected that w1 was a constant and attribute a assumed value 1 for MULTIPARA, 2 for PRIMIPARA, and 3 for NOVILHA. Instead of what I expected, the attribute in Eq1 for CATEG does not assume values 1, 2, and 3 for MULTIPARA, PRIMIPARA, and NOVILHA respectfully. 

That is, the true equation is not Eq1. The true equation is given by Eq2:
  P(1| a,b,c) = 1 / (1 + Exp(w0 + w11*a1 + w12*a2+ w13*a3 + w2*b + w3*c))   (Eq2)  
where a1, a2 and a3 are binary variables for CATEG = MULTIPARAS, PRIMIPARAS, AND NOVILHAS

In R, we can do logistic regression, and the results will appear as Eq. 1 and not as weka's Eq.2, where nominal attributes are binarized.

Is there a way to make the regression similar to R? Because this is easy to interpret when an attribute is nominal with 3 possible values, however, when the attribute has 87 possible values, things start to get messy. In my case, I have a couple more attributes with 87 possible values and 20 possible values. And things get ugly.

Cheers,

Luisa





Virus-free. www.avast.com
<a href="x-msg://109/#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2" width="1" height="1" class="">
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html


_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html