How to get the p-Values from the logistic regression in weka

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

How to get the p-Values from the logistic regression in weka

Steffen Albrecht

Hi everyone,

the question is whether it is possible to get the p-Value(s) of the logistic regression implemented in weka (weka.classifiers.functions.Logistic). I searched for this topic but I did not find a solution. What I want to do is this (example using R):

--------------------------------------------------------------------------------

glm(formula = "class ~ attribute1", family = binomial("logit"), data = d)

Deviance Residuals:

    Min     1Q     Median     3Q     Max
-1.24744     -1.20809     0.06943     1.12129     1.35639

Coefficients:

                        Estimate     Std. Error     z value         Pr(>|z|)
(Intercept)         0.171438     0.786038      0.218            0.827

attribute1     -0.007521      0.020550     -0.366         0.714

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 13.863 on 9 degrees of freedom
    Residual deviance: 13.726 on 8 degrees of freedom

AIC: 17.726

Number of Fisher Scoring iterations: 4
--------------------------------------------------------------------------------

In this example I built a generalized linear model with the „logit“ as the link function. I also used this model as a classifier and found out that the AUC of the glm in R is exactly the same as the AUC of the classifier Logistic in weka. So I think, that the implementation is very similar. I am interested in the p-Value from the logistic regression of one attribute, using just this attribute and the class-attribute. The value is marked orange. I want to use the p-Value in a significance test to filter attributes. The z-Value would also be interesting.

Until now I used the glm of R, as it is displayed in the example. The problem is that, there are a lot of attribute to analyze and the function in R is not that fast. Furthermore the R-script I implemented is not stable on our HPC anymore (2 months ago I had no problem with this script). And another motivation is that I would like to use just java for all my programs.

So I would be glad to get an answer to my topic "How to get the p-Values from the logistic regression in weka"!
And I would be very happy to get a short explanation instead of just the formula (if possible).

With kind regards

Steffen Albrecht (from Germany)


_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: How to get the p-Values from the logistic regression in weka

Eibe Frank-2
Administrator
Here is some Groovy code for WEKA 3.8 (easily translatable to full Java) that computes the p-values for all coefficients of a logistic regression model:

=================

// Compute statistics for coefficients of logistic regression model,
// for two-class data with numeric predictors and no missing values

import weka.core.Utils

import no.uib.cipr.matrix.DenseMatrix
import no.uib.cipr.matrix.UpperSymmDenseMatrix
import no.uib.cipr.matrix.Matrices

datasource = new weka.core.converters.ConverterUtils.DataSource("/Users/eibe/datasets/UCI/diabetes.arff")
data = datasource.getDataSet()
data.setClassIndex(data.numAttributes() - 1)

classifier = new weka.classifiers.functions.Logistic()
classifier.buildClassifier(data)

coefficients = classifier.coefficients()

// Store number of instances and number of attributes
n = data.numInstances()
m = data.numAttributes()

// Establish required matrices
X = new DenseMatrix(n, m)
V = new UpperSymmDenseMatrix(n)
for (int i = 0; i < n; i++) {
  p = classifier.distributionForInstance(data.instance(i))[0]
  V.set(i, i, p * (1 - p))
  index = 0;
  X.set(i, index++, 1.0)
  for (int j = 0; j < m; j++) {
    if (j != data.classIndex()) {
      X.set(i, index++, data.instance(i).value(j))
    }
  }
}

// Compute M = X'VX
M = X.transpose(new DenseMatrix(m, n)).mult(V, new DenseMatrix(new DenseMatrix(m, n))).mult(X, new UpperSymmDenseMatrix(m))

// Compute covariance matrix for parameters (inverse of M)
I = Matrices.identity(m)
C = I.copy()
C = M.solve(I, C);

println("\tEstimate\t\tStd. Error\tz value\tPr(>|z|)")
for (int j = 0; j < m; j++) {
  if (j == 0) {
    print("Interc.")
  } else {
    print(data.attribute(j).name())
  }
  c = coefficients[j][0]
  e = Math.sqrt(C.get(j, j))
  z = c / e
  p = 2.0 * (1.0 - weka.core.Statistics.normalProbability(Math.abs(z)))
  print("\t" + Utils.doubleToString(c, 7))
  print("\t" + Utils.doubleToString(e, 7))
  print("\t" + Utils.doubleToString(z, 3))
  print("\t" + Utils.doubleToString(p, 6))
  println()
}

=================

Here is the output from running this code in WEKA's Groovy console:

        Estimate                Std. Error      z value Pr(>|z|)                                                          
Interc. 8.4046802               0.7166351       11.728  0                                                                  
plas    -0.1231818              0.0320775       -3.84   0.000123                                                          
pres    -0.0351637              0.0037087       -9.481  0                                                                  
skin    0.0132955               0.0052336       2.54    0.011072                                                          
insu    -0.000619               0.0068994       -0.09   0.928514                                                          
mass    0.0011917               0.0009012       1.322   0.186063                                                          
pedi    -0.0897007              0.0150876       -5.945  0                                                                  
age     -0.9451775              0.2991473       -3.16   0.00158                                                            
class   -0.014869               0.0093348       -1.593  0.111192                                                          

And here is the result from doing the same in R:

> library(RWeka)
> d = read.arff("/Users/eibe/datasets/UCI/diabetes.arff")                                                                                                                          
> l = glm(formula = "class ~ .", family = binomial("logit"), data = d)
> summary(l)

Call:
glm(formula = "class ~ .", family = binomial("logit"), data = d)

Deviance Residuals:
    Min       1Q   Median       3Q      Max  
-2.5566  -0.7274  -0.4159   0.7267   2.9297  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -8.4046964  0.7166359 -11.728  < 2e-16 ***
preg         0.1231823  0.0320776   3.840 0.000123 ***
plas         0.0351637  0.0037087   9.481  < 2e-16 ***
pres        -0.0132955  0.0052336  -2.540 0.011072 *  
skin         0.0006190  0.0068994   0.090 0.928515    
insu        -0.0011917  0.0009012  -1.322 0.186065    
mass         0.0897010  0.0150876   5.945 2.76e-09 ***
pedi         0.9451797  0.2991475   3.160 0.001580 **
age          0.0148690  0.0093348   1.593 0.111192    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 993.48  on 767  degrees of freedom
Residual deviance: 723.45  on 759  degrees of freedom
AIC: 741.45

Number of Fisher Scoring iterations: 5


Cheers,
Eibe

> On 15 Jun 2016, at 00:16, Steffen Albrecht <[hidden email]> wrote:
>
> Hi everyone,
> the question is whether it is possible to get the p-Value(s) of the logistic regression implemented in weka (weka.classifiers.functions.Logistic). I searched for this topic but I did not find a solution. What I want to do is this (example using R):
> --------------------------------------------------------------------------------
> glm(formula = "class ~ attribute1", family = binomial("logit"), data = d)
> Deviance Residuals:
>     Min     1Q     Median     3Q     Max
> -1.24744     -1.20809     0.06943     1.12129     1.35639
> Coefficients:
>                         Estimate     Std. Error     z value         Pr(>|z|)
> (Intercept)         0.171438     0.786038      0.218            0.827
> attribute1     -0.007521      0.020550     -0.366         0.714
> (Dispersion parameter for binomial family taken to be 1)
>     Null deviance: 13.863 on 9 degrees of freedom
>     Residual deviance: 13.726 on 8 degrees of freedom
> AIC: 17.726
> Number of Fisher Scoring iterations: 4
> --------------------------------------------------------------------------------
> In this example I built a generalized linear model with the „logit“ as the link function. I also used this model as a classifier and found out that the AUC of the glm in R is exactly the same as the AUC of the classifier Logistic in weka. So I think, that the implementation is very similar. I am interested in the p-Value from the logistic regression of one attribute, using just this attribute and the class-attribute. The value is marked orange. I want to use the p-Value in a significance test to filter attributes. The z-Value would also be interesting.
> Until now I used the glm of R, as it is displayed in the example. The problem is that, there are a lot of attribute to analyze and the function in R is not that fast. Furthermore the R-script I implemented is not stable on our HPC anymore (2 months ago I had no problem with this script). And another motivation is that I would like to use just java for all my programs.
> So I would be glad to get an answer to my topic "How to get the p-Values from the logistic regression in weka"!
> And I would be very happy to get a short explanation instead of just the formula (if possible).
> With kind regards
> Steffen Albrecht (from Germany)
> _______________________________________________
> Wekalist mailing list
> Send posts to: [hidden email]
> List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
> List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: How to get the p-Values from the logistic regression in weka

Saeed
In reply to this post by Steffen Albrecht
Hi,
I've been using Weka for a while and i built some models based on weka
classifiers libraries.
 I saw your link below explaining how to calculate pvalues for proving the
statistical significance of the model but when running in Linux machine i'm
getting that it's deprecated and not able to do the matrix multiplications
and solve function with errors below, could you please help ?
https://weka.8497.n7.nabble.com/How-to-get-the-p-Values-from-the-logistic-regression-in-weka-td37743.html:
 I'm using the following jars for this :
 netlib-java-1.1.jar
 arpack_combined_all-0.1.jar
 core-1.1.jar
 
 
 converting the train directory into Instances object (WEKA)
 building decision tree classifier
        Estimate                Std. Error      z value Pr(>|z|)
 Interc.
 Exception in thread "main" java.lang.NoClassDefFoundError:
org/netlib/lapack/LAPACK

>        at
> no.uib.cipr.matrix.AbstractSymmDenseMatrix.solve(AbstractSymmDenseMatrix.java:222)
>        at
> no.uib.cipr.matrix.UpperSymmDenseMatrix.solve(UpperSymmDenseMatrix.java:30)
>        at
> javaapplication4.application.Classifier.build(Classifier.java:149)
>        at javaapplication4.UrlCSV.processcsv(UrlCSV.java:582)
>        at javaapplication4.UrlCSV.main(UrlCSV.java:4274)
> Caused by: java.lang.ClassNotFoundException: org.netlib.lapack.LAPACK
>        at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>        at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335)
>        at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>        ... 5 more



--
Sent from: https://weka.8497.n7.nabble.com/
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html