Ranking measures in WEKA

classic Classic list List threaded Threaded
14 messages Options
Reply | Threaded
Open this post in threaded view
|

Ranking measures in WEKA

johannes maucher
Hi,
I recently tried to find out which heuristic is used in the Weka Evaluator CfsSubsetEval. For this I consulted Mark Hall's PH.D.-thesis. There 3 different variations are mentioned: for the correlations in formula 4.16 either Symmetric Uncertainty, Relief or MDL can be applied. But still I did not know which one is applied in Weka. By calculating it manually I found out it is Symmetric Uncertainty, i.e. exactly the Cfs heuristic formula which is printed on Page 292, chapter 7.1. in the Data Mining book of Witten and Frank (2nd Edition). For numeric features MDL-discretization is applied internally!?
Can anybody approve this?

Thanks and best regards
Johannes
_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Ranking measures in WEKA

mhall
Administrator
johannes maucher wrote:

> Hi,
> I recently tried to find out which heuristic is used in the Weka
> Evaluator CfsSubsetEval. For this I consulted Mark Hall's PH.D.-thesis.
> There 3 different variations are mentioned: for the correlations in
> formula 4.16 either Symmetric Uncertainty, Relief or MDL can be applied.
> But still I did not know which one is applied in Weka. By calculating it
> manually I found out it is Symmetric Uncertainty, i.e. exactly the Cfs
> heuristic formula which is printed on Page 292, chapter 7.1. in the Data
> Mining book of Witten and Frank (2nd Edition). For numeric features
> MDL-discretization is applied internally!?
> Can anybody approve this?

This is correct. Fayyad and Iran's method is used to discretize numeric
features before computing the heuristic (using symmetrical uncertainty)
when the class is discrete. If the class is numeric then standard
Pearson's correlation is used in the merit formula. In this case, any
nominal attributes are effectively converted to binary indicator
attributes and a weighted Pearson's is used when computing their
correlation to the class and intercorrelation with other attributes
(this is not in the thesis but is discussed in my ICML paper on CFS).

Cheers,
Mark.


_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Predictions

Elchin Julfayev
Hello,
I am just Java programmer. And for our research I have to make some predictions.
For this I have created a training sets and run classification with cross validation (10) in CLI with AdaBoostM1 and using 1000 iterations:

 cmd1 = "java -Xms64m -Xmx2048m -XX:-UseGCOverheadLimit weka.classifiers.meta.AdaBoostM1 -W weka.classifiers.trees.DecisionStump -S 1 -I 1000 -i -t myTraining.arff -d myModel.model"

As a classification variable I used variable (let say = myClass) that has values 0 or 1. And I am classifying over other 15 variables (attributes).
While classified I also outputted TRUe_positive, False_positive, ROC_AUC parameters.

For predictions I am using CLI command:

cmd2 = "java weka.classifiers.meta.AdaBoostM1 -l myModel.model -T myTest.arff -p 0"

The format (attributes) in myTest.arff is the same as in myTraining.arff.
Here I have 2 questions and would be very grateful if you can tell me something about this:

1. I heard that in test .arff file I can't put CLASS values for the Classification variable (myClass). I it correct that I have to put there "?" mark instead of 0, or 1 that I used ???

2. How can I use for prediction probability calculation the multipliers (parameters) - TRUe_positive, False_positive, ROC_AUC ??

My PREDICTION probabilities are mostly = 1 that is very suspicious
Regards,
Elchin Julfayev
 


From: Mark Hall <[hidden email]>
To: Weka machine learning workbench list. <[hidden email]>
Sent: Thu, February 4, 2010 8:29:03 AM
Subject: Re: [Wekalist] Ranking measures in WEKA

johannes maucher wrote:
> Hi,
> I recently tried to find out which heuristic is used in the Weka Evaluator CfsSubsetEval. For this I consulted Mark Hall's PH.D.-thesis. There 3 different variations are mentioned: for the correlations in formula 4.16 either Symmetric Uncertainty, Relief or MDL can be applied. But still I did not know which one is applied in Weka. By calculating it manually I found out it is Symmetric Uncertainty, i.e. exactly the Cfs heuristic formula which is printed on Page 292, chapter 7.1. in the Data Mining book of Witten and Frank (2nd Edition). For numeric features MDL-discretization is applied internally!?
> Can anybody approve this?

This is correct. Fayyad and Iran's method is used to discretize numeric features before computing the heuristic (using symmetrical uncertainty) when the class is discrete. If the class is numeric then standard Pearson's correlation is used in the merit formula. In this case, any nominal attributes are effectively converted to binary indicator attributes and a weighted Pearson's is used when computing their correlation to the class and intercorrelation with other attributes (this is not in the thesis but is discussed in my ICML paper on CFS).

Cheers,
Mark.


_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html


_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

datasets from uci

Nancy Adam

Hi,

I made regression to the attached datasets (from UCI) and I receive big values for RMSE (33 and 44).

I understand that RMSE should not be a big number. Is it correct?

 

Can you please do regression to them using any regression algorithm of weka so that I know whether I’m right or not?

Thanks,

Nancy

 



Your E-mail and More On-the-Go. Get Windows Live Hotmail Free. Sign up now.
_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

wisconsin 33.txt (56K) Download Attachment
pollution 16.txt (5K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: datasets from uci

Harri Saarikoski-2


2010/2/5 Nancy Adam <[hidden email]>

Hi,

I made regression to the attached datasets (from UCI) and I receive big values for RMSE (33 and 44).

I understand that RMSE should not be a big number. Is it correct?

 

Can you please do regression to them using any regression algorithm of weka so that I know whether I’m right or not?

 
hmm I really don't think that's how it works here...
 
question should be more specific and work done (on multiple classifiers) self first, and should be preceded by conveyed understanding of basic things like "(yes) rmse should be as low as possible" (being an error rate quality method)
 
-> the wide machine learning community has proper forums for such questions, if you want them properly answered
Hari
 

Thanks,

Nancy

 



Your E-mail and More On-the-Go. Get Windows Live Hotmail Free. Sign up now.

_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html




--
-----------------
Harri M.T. Saarikoski
M.A, PhD graduate student
Helsinki University
Finland

_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: datasets from uci

Marcin Wojnarski
In reply to this post by Nancy Adam
Nancy,

RMSE values for Wisconsin Breast Cancer data set can be found at TunedIT:

http://tunedit.org/results?e=regression&d=wisconsin

Predictions were made by different general-purpose algorithms from Weka.
These values are indeed in the range of 30-35. You can check the exact
contents of TunedIT version of these data at:

http://tunedit.org/repo/UCI/numeric/wisconsin.arff

In TunedIT results, the _last_ column, "time", is the predicted one.

Regards
Marcin


Nancy Adam wrote:

>
> Hi,
>
> I made regression to the attached datasets (from UCI) and I receive
> big values for RMSE (33 and 44).
>
> I understand that RMSE should not be a big number. Is it correct?
>
>
>
> Can you please do regression to them using any regression algorithm of
> weka so that I know whether I’m right or not?
>
> Thanks,
>
> Nancy
>
>
>
--
Marcin Wojnarski, Project Lead, TunedIT
tel.: +48 22 662 31 96
http://tunedit.org

Machine Learning & Data Mining Research -
Automated Tests, Repeatable Experiments, Meaningful Results



_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Best algorithms for text classifiaction

Piotrek-12
Hi,

1. What are the best algorithms for text classification in Weka? I've
found Naive Bayes, SVMreg and KNN, but KNN doesn't return any P,R or F1
measures...
2. How can I compare the results of KNN and Naive Bayes?
3. Is  there any implementation of the Rocchio algorithm in Weka software?

Thanks in advance for help.

Regards,
Pasi


_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

RE: datasets from uci

Nancy Adam
In reply to this post by Marcin Wojnarski
Hi Marcin,
thank you so much for your help. it is really very helpful,
 
Nancy
 

> Date: Sat, 6 Feb 2010 13:56:07 +0100
> From: [hidden email]
> To: [hidden email]
> Subject: Re: [Wekalist] datasets from uci
>
> Nancy,
>
> RMSE values for Wisconsin Breast Cancer data set can be found at TunedIT:
>
> http://tunedit.org/results?e=regression&d=wisconsin
>
> Predictions were made by different general-purpose algorithms from Weka.
> These values are indeed in the range of 30-35. You can check the exact
> contents of TunedIT version of these data at:
>
> http://tunedit.org/repo/UCI/numeric/wisconsin.arff
>
> In TunedIT results, the _last_ column, "time", is the predicted one.
>
> Regards
> Marcin
>
>
> Nancy Adam wrote:
> >
> > Hi,
> >
> > I made regression to the attached datasets (from UCI) and I receive
> > big values for RMSE (33 and 44).
> >
> > I understand that RMSE should not be a big number. Is it correct?
> >
> >
> >
> > Can you please do regression to them using any regression algorithm of
> > weka so that I know whether I’m right or not?
> >
> > Thanks,
> >
> > Nancy
> >
> >
> >
>
> --
> Marcin Wojnarski, Project Lead, TunedIT
> tel.: +48 22 662 31 96
> http://tunedit.org
>
> Machine Learning & Data Mining Research -
> Automated Tests, Repeatable Experiments, Meaningful Results
>
>


Hotmail: Free, trusted and rich email service. Get it now.
_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Ranking measures in WEKA

Abdrahman0x
In reply to this post by mhall
mhall wrote

> johannes maucher wrote:
>> Hi,
>> I recently tried to find out which heuristic is used in the Weka
>> Evaluator CfsSubsetEval. For this I consulted Mark Hall's PH.D.-thesis.
>> There 3 different variations are mentioned: for the correlations in
>> formula 4.16 either Symmetric Uncertainty, Relief or MDL can be applied.
>> But still I did not know which one is applied in Weka. By calculating it
>> manually I found out it is Symmetric Uncertainty, i.e. exactly the Cfs
>> heuristic formula which is printed on Page 292, chapter 7.1. in the Data
>> Mining book of Witten and Frank (2nd Edition). For numeric features
>> MDL-discretization is applied internally!?
>> Can anybody approve this?
>
> This is correct. Fayyad and Iran's method is used to discretize numeric
> features before computing the heuristic (using symmetrical uncertainty)
> when the class is discrete. If the class is numeric then standard
> Pearson's correlation is used in the merit formula. In this case, any
> nominal attributes are effectively converted to binary indicator
> attributes and a weighted Pearson's is used when computing their
> correlation to the class and intercorrelation with other attributes
> (this is not in the thesis but is discussed in my ICML paper on CFS).
>
> Cheers,
> Mark.
>
>
> _______________________________________________
> Wekalist mailing list
> Send posts to:

> Wekalist@.ac

> List info and subscription status:
> https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
> List etiquette:
> http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

What about if the class is nominal and the attributes are numeric. I think
Pearson Correlation can be used to find the relation between the attributes,
but still how can we find the relation between the attributes and the class.
I hope my question is clear.

Is there an alternative way inside Weka to do so.

Thanks



--
Sent from: http://weka.8497.n7.nabble.com/
_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Ranking measures in WEKA

Mark Hall


On 16/10/18, 10:01 PM, "Abdrahman0x" <[hidden email] on behalf of [hidden email]> wrote:

    mhall wrote
    > johannes maucher wrote:
    >> Hi,
    >> I recently tried to find out which heuristic is used in the Weka
    >> Evaluator CfsSubsetEval. For this I consulted Mark Hall's PH.D.-thesis.
    >> There 3 different variations are mentioned: for the correlations in
    >> formula 4.16 either Symmetric Uncertainty, Relief or MDL can be applied.
    >> But still I did not know which one is applied in Weka. By calculating it
    >> manually I found out it is Symmetric Uncertainty, i.e. exactly the Cfs
    >> heuristic formula which is printed on Page 292, chapter 7.1. in the Data
    >> Mining book of Witten and Frank (2nd Edition). For numeric features
    >> MDL-discretization is applied internally!?
    >> Can anybody approve this?
    >
    > This is correct. Fayyad and Iran's method is used to discretize numeric
    > features before computing the heuristic (using symmetrical uncertainty)
    > when the class is discrete. If the class is numeric then standard
    > Pearson's correlation is used in the merit formula. In this case, any
    > nominal attributes are effectively converted to binary indicator
    > attributes and a weighted Pearson's is used when computing their
    > correlation to the class and intercorrelation with other attributes
    > (this is not in the thesis but is discussed in my ICML paper on CFS).
    >
    > Cheers,
    > Mark.
    >
    >
    > _______________________________________________
    > Wekalist mailing list
    > Send posts to:
   
    > Wekalist@.ac
   
    > List info and subscription status:
    > https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
    > List etiquette:
    > http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
   
    What about if the class is nominal and the attributes are numeric. I think
    Pearson Correlation can be used to find the relation between the attributes,
    but still how can we find the relation between the attributes and the class.
    I hope my question is clear.

In this case, CFS discretizes numeric attributes (using the MLD-based supervised method of Fayad and Irani) and then computes all correlation scores using the information-theoretic "symmetrical uncertainty" measure.
   
    Is there an alternative way inside Weka to do so.

There is not currently an option that allows Pearson's correlation to be used in the case when the class is nominal. Although it seems like a million years ago now, I'm pretty sure I tried this at the time I was working on my thesis and it did not give as good results with common learning algorithms as symmetrical uncertainty did.

Cheers,
Mark.
 


_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Ranking measures in WEKA

Abdrahman0x
Thank you Mark for your response, but am still little confused.

In your answer you had mentioned that CFS will be disretized for the numeric
attributes. Actually, in my dataset I have numeric attributes and the only
issue is with the class attributes which are nominal. I could apply the
internal Pearson Correlation to compute the correlation between the numeric
attributes (I used Data Analysis inside MS Excel for this purpose), the only
issue is how to compute the correlation with the nominal class attributes.
Inside your thesis, I found a good example (Table 4.2, and Table 4.3), as
your "Golf" dataset is something similar to my work (it has a nominal class
attribute).

*(Question 1)* In your Table 4.2 (Page 72), you found the features
correlations between the attributes and between the class using "Relief",
but when I applied the Relief algorithm using Weka in my dataset, I got
confused about the output. Can you explain to me how did you get the class
attributes (Table 4.2 the class column) values using Relief. If you don't
mind explain to me the calculation steps to get the values of (0.130, 0.025,
0.185, 0.081) inside the table.

*(Question 2)* One more thing, in your Table 4.3 (Page 73), I understand the
(rff) column for the computation between two attributes which was calculated
in Table 4.3, but couldn't understand the same value when computed between 3
attributes; for examples between [Outlook Temperature Humidity] why the
value is 0.132 from where did you get this value?

*(Question 3)* Note, in your Table 4.3 (Page 73), the (rff) correlation
between [Temperature Humidity] you have written (0.258), I think it is
supposed to be (0.248) as shown in Table 4.2 (Page 72).  Am I right or
wrong. Can you please explain.

I am sorry for my 3 long questions in my post, bur I am still a beginner and
would like to learn. I would appreciate your patient support.

Thank you so much in advance for your patience and for your support.

Many thanks,
Abdrahman



--
Sent from: http://weka.8497.n7.nabble.com/
_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Ranking measures in WEKA

Mark Hall
Hi Abdrahman,

The Golf/weather data used for the tables is the all nominal version (weather.nominal.arff, as included with the Weka distribution). So, these examples are not operating with numeric attributes.

In answer to your first question: the relief calculation used in this table is the version that assumes attribute independence. It is given on page 58, and uses the Gini' index. It applies to nominal attributes only. You can also take a look at Kononenko's paper on Relief that describes this metric:

https://link.springer.com/content/pdf/10.1007%2F3-540-57868-4_57.pdf

You should be able to use the contingency tables on page 11, along with the formulas on page 58 to compute the values in table 4.2.

As for your second (and third) question: There is actually a typo in table 4.2 (I think) - the correlation between temperature and humidity should be 0.258. The calculation for the intercorrelation between outlook, temperature and humidity is (0.116 + 0.022 + 0.258) / 3 = 0.132.

Hope this helps!

Cheers,
Mark.

On 21/10/18, 8:37 PM, "Abdrahman0x" <[hidden email] on behalf of [hidden email]> wrote:

    Thank you Mark for your response, but am still little confused.
   
    In your answer you had mentioned that CFS will be disretized for the numeric
    attributes. Actually, in my dataset I have numeric attributes and the only
    issue is with the class attributes which are nominal. I could apply the
    internal Pearson Correlation to compute the correlation between the numeric
    attributes (I used Data Analysis inside MS Excel for this purpose), the only
    issue is how to compute the correlation with the nominal class attributes.
    Inside your thesis, I found a good example (Table 4.2, and Table 4.3), as
    your "Golf" dataset is something similar to my work (it has a nominal class
    attribute).
   
    *(Question 1)* In your Table 4.2 (Page 72), you found the features
    correlations between the attributes and between the class using "Relief",
    but when I applied the Relief algorithm using Weka in my dataset, I got
    confused about the output. Can you explain to me how did you get the class
    attributes (Table 4.2 the class column) values using Relief. If you don't
    mind explain to me the calculation steps to get the values of (0.130, 0.025,
    0.185, 0.081) inside the table.
   
    *(Question 2)* One more thing, in your Table 4.3 (Page 73), I understand the
    (rff) column for the computation between two attributes which was calculated
    in Table 4.3, but couldn't understand the same value when computed between 3
    attributes; for examples between [Outlook Temperature Humidity] why the
    value is 0.132 from where did you get this value?
   
    *(Question 3)* Note, in your Table 4.3 (Page 73), the (rff) correlation
    between [Temperature Humidity] you have written (0.258), I think it is
    supposed to be (0.248) as shown in Table 4.2 (Page 72).  Am I right or
    wrong. Can you please explain.
   
    I am sorry for my 3 long questions in my post, bur I am still a beginner and
    would like to learn. I would appreciate your patient support.
   
    Thank you so much in advance for your patience and for your support.
   
    Many thanks,
    Abdrahman
   
   
   
    --
    Sent from: http://weka.8497.n7.nabble.com/
    _______________________________________________
    Wekalist mailing list
    Send posts to: [hidden email]
    List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
    List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
   


_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Ranking measures in WEKA

Abdrahman0x
Thank you so much Mark for your reply as well as for your clarification.

I have one thing that still I couldn't resolve. How to find the (rcf) the
correlation between features and classes in the case of numeric features and
nominal class. I tried to understand the application of Fyyad and Irani
discretization as discussed but couldn't understand.

I need to a way to understand the calculation. I would appreciate if you can
provide me with a step by step example or a good reference with examples of
the calculation.

Thank you
Abdrahman



--
Sent from: http://weka.8497.n7.nabble.com/
_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Ranking measures in WEKA

Abdrahman0x
Hi...

In the Thesis page 73, Table 4.3. How did you get the rff between the:
[Outlook Temperature Humidity Wind]

I tried the following but the answer was different:
[0.116+0.022+0.007+0.258+0.028]/4 = 0.108 which is different than your
answer (0.0718)

Note that I did the calculation as it was done using the same way to
calculate the rff of 3 attributes (divide by 3).

Can you please clarify.

Also, if you don't mind to give me some clarifications on my previous post
up regarding the correlation between numeric attributes and nominal class.

Thank you,
Abdrahman



--
Sent from: http://weka.8497.n7.nabble.com/
_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html