Possible error with M5Rules?

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Possible error with M5Rules?

achar
Hello,

I am using M5Rules to predict the "Period (sec)" i.e. the 6th column in my
data. I have written a parser in .NET to convert the rules to Excel
formulas, but when i apply the formula in Excel I get completely erroneous
results. The formula seems correct and I am afraid there's something
fundamental that I do not grasp. I have put minNumInstances = 200 and
trained on the whole set to produce as few rules as possible for debugging
purposes. Here's the output:

=== Run information ===

Scheme:       weka.classifiers.rules.M5Rules -M 200.0
Relation:     data
Instances:    4026
Attributes:   6
              Number of Storeys
              Number of Spans
              Length of Spans (m)
              Opening percentage (%)
              Masonry wall Stiffeness Et (x10^5 kN/m)
              Period (Sec)
Test mode:    evaluate on training data

=== Classifier model (full training set) ===

M5 pruned model rules
(using smoothed linear models) :
Number of Rules : 8

Rule: 1
IF
        Number of Storeys > 9.5
        Opening percentage (%) > 37.5
        Number of Storeys > 15.5
THEN

Period (Sec) =
        0.1215 * Number of Storeys
        - 0.0883 * Number of Spans
        + 0.243 * Length of Spans (m)
        + 0.0002 * Opening percentage (%)
        - 0.0002 * Masonry wall Stiffeness Et (x10^5 kN/m)
        - 0.9072 [882/29.044%]

Rule: 2
IF
        Number of Storeys > 6.5
        Opening percentage (%) <= 62.5
THEN

Period (Sec) =
        0.0532 * Number of Storeys
        + 0.0016 * Length of Spans (m)
        + 0.0108 * Opening percentage (%)
        - 0.0233 * Masonry wall Stiffeness Et (x10^5 kN/m)
        + 0.1191 [1104/26.789%]

Rule: 3
IF
        Number of Storeys > 6.5
        Number of Storeys > 10.5
THEN

Period (Sec) =
        0.1259 * Number of Storeys
        + 0.1856 * Length of Spans (m)
        + 0 * Opening percentage (%)
        - 1.0226 [523/8.941%]

Rule: 4
IF
        Number of Storeys > 6.5
THEN

Period (Sec) =
        0.1222 * Number of Storeys
        + 0.113 * Length of Spans (m)
        + 0.0001 * Opening percentage (%)
        - 0.6161 [419/11.291%]

Rule: 5
IF
        Number of Storeys > 3.5
        Opening percentage (%) > 37.5
THEN

Period (Sec) =
        0.1229 * Number of Storeys
        + 0.061 * Length of Spans (m)
        + 0.0002 * Opening percentage (%)
        - 0.3746 [378/32.319%]

Rule: 6
IF
        Number of Storeys > 1.5
        Opening percentage (%) <= 62.5
THEN

Period (Sec) =
        0.0481 * Number of Storeys
        + 0.0008 * Length of Spans (m)
        + 0.0026 * Opening percentage (%)
        + 0.0041 [327/64.783%]

Rule: 7
IF
        Number of Storeys > 1.5
THEN

Period (Sec) =
        0.0071 * Number of Storeys
        + 0.0219 * Length of Spans (m)
        + 0.2009 [210/26.645%]

Rule: 8

Period (Sec) =
        + 0.1423 [183/100%]



Time taken to build model: 0.14 seconds

=== Evaluation on training set ===

Time taken to test model on training data: 0.01 seconds

=== Summary ===

Correlation coefficient                  0.9856
Mean absolute error                      0.0824
Root mean squared error                  0.1326
Relative absolute error                 12.7376 %
Root relative squared error             16.896  %
Total Number of Instances             4026    


The correlation coefficient is rather high. In Excel the corresponding data
are in the following columns:

              A: Number of Storeys
              B: Number of Spans
              C: Length of Spans (m)
              D: Opening percentage (%)
              E: Masonry wall Stiffeness Et (x10^5 kN/m)
             
In column F, row 2, I evaluate the Period (Sec) using the formula
=IF(AND(A2>9.5,D2>37.5,A2>15.5),0.1215*A2-0.0883*B2+0.243*C2+0.0002*D2-0.0002*E2-0.9072,IF(AND(A2>6.5,D2<=62.5),0.0532*A2+0.0016*C2+0.0108*D2-0.0233*E2+0.1191,IF(AND(A2>6.5,A2>10.5),0.1259*A2+0.1856*C2+0*D2-1.0226,IF(A2>6.5,0.1222*A2+0.113*C2+0.0001*D2-0.6161,IF(AND(A2>3.5,D2>37.5),0.1229*A2+0.061*C2+0.0002*D2-0.3746,IF(AND(A2>1.5,D2<=62.5),0.0481*A2+0.0008*C2+0.0026*D2+0.0041,IF(A2>1.5,0.0071*A2+0.0219*C2+0.2009,0.1423)))))))

When I evaluate the predicted values, they are way, way off, as shown in the
picture. The points should be close to the 45 deg line:

<https://weka.8497.n7.nabble.com/file/t6958/1.png>

I am completely buffled, and I suspect it is something obvious but cannot
find it. The same occurs with properly trained models and many rules. Any
ideas?

TIA




--
Sent from: https://weka.8497.n7.nabble.com/
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Possible error with M5Rules?

Eibe Frank-3
Have you computed the correlation coefficient, etc., based on predictions in Excel? Unfortunately, those summary statistics often don’t tell the whole story and the corresponding scatter plots can be surprising.

Cheers,
Eibe

On Sun, 20 Oct 2019 at 1:08 PM, achar <[hidden email]> wrote:
Hello,

I am using M5Rules to predict the "Period (sec)" i.e. the 6th column in my
data. I have written a parser in .NET to convert the rules to Excel
formulas, but when i apply the formula in Excel I get completely erroneous
results. The formula seems correct and I am afraid there's something
fundamental that I do not grasp. I have put minNumInstances = 200 and
trained on the whole set to produce as few rules as possible for debugging
purposes. Here's the output:

=== Run information ===

Scheme:       weka.classifiers.rules.M5Rules -M 200.0
Relation:     data
Instances:    4026
Attributes:   6
              Number of Storeys
              Number of Spans
              Length of Spans (m)
              Opening percentage (%)
              Masonry wall Stiffeness Et (x10^5 kN/m)
              Period (Sec)
Test mode:    evaluate on training data

=== Classifier model (full training set) ===

M5 pruned model rules
(using smoothed linear models) :
Number of Rules : 8

Rule: 1
IF
        Number of Storeys > 9.5
        Opening percentage (%) > 37.5
        Number of Storeys > 15.5
THEN

Period (Sec) =
        0.1215 * Number of Storeys
        - 0.0883 * Number of Spans
        + 0.243 * Length of Spans (m)
        + 0.0002 * Opening percentage (%)
        - 0.0002 * Masonry wall Stiffeness Et (x10^5 kN/m)
        - 0.9072 [882/29.044%]

Rule: 2
IF
        Number of Storeys > 6.5
        Opening percentage (%) <= 62.5
THEN

Period (Sec) =
        0.0532 * Number of Storeys
        + 0.0016 * Length of Spans (m)
        + 0.0108 * Opening percentage (%)
        - 0.0233 * Masonry wall Stiffeness Et (x10^5 kN/m)
        + 0.1191 [1104/26.789%]

Rule: 3
IF
        Number of Storeys > 6.5
        Number of Storeys > 10.5
THEN

Period (Sec) =
        0.1259 * Number of Storeys
        + 0.1856 * Length of Spans (m)
        + 0 * Opening percentage (%)
        - 1.0226 [523/8.941%]

Rule: 4
IF
        Number of Storeys > 6.5
THEN

Period (Sec) =
        0.1222 * Number of Storeys
        + 0.113 * Length of Spans (m)
        + 0.0001 * Opening percentage (%)
        - 0.6161 [419/11.291%]

Rule: 5
IF
        Number of Storeys > 3.5
        Opening percentage (%) > 37.5
THEN

Period (Sec) =
        0.1229 * Number of Storeys
        + 0.061 * Length of Spans (m)
        + 0.0002 * Opening percentage (%)
        - 0.3746 [378/32.319%]

Rule: 6
IF
        Number of Storeys > 1.5
        Opening percentage (%) <= 62.5
THEN

Period (Sec) =
        0.0481 * Number of Storeys
        + 0.0008 * Length of Spans (m)
        + 0.0026 * Opening percentage (%)
        + 0.0041 [327/64.783%]

Rule: 7
IF
        Number of Storeys > 1.5
THEN

Period (Sec) =
        0.0071 * Number of Storeys
        + 0.0219 * Length of Spans (m)
        + 0.2009 [210/26.645%]

Rule: 8

Period (Sec) =
        + 0.1423 [183/100%]



Time taken to build model: 0.14 seconds

=== Evaluation on training set ===

Time taken to test model on training data: 0.01 seconds

=== Summary ===

Correlation coefficient                  0.9856
Mean absolute error                      0.0824
Root mean squared error                  0.1326
Relative absolute error                 12.7376 %
Root relative squared error             16.896  %
Total Number of Instances             4026     


The correlation coefficient is rather high. In Excel the corresponding data
are in the following columns:

              A: Number of Storeys
              B: Number of Spans
              C: Length of Spans (m)
              D: Opening percentage (%)
              E: Masonry wall Stiffeness Et (x10^5 kN/m)

In column F, row 2, I evaluate the Period (Sec) using the formula
=IF(AND(A2>9.5,D2>37.5,A2>15.5),0.1215*A2-0.0883*B2+0.243*C2+0.0002*D2-0.0002*E2-0.9072,IF(AND(A2>6.5,D2<=62.5),0.0532*A2+0.0016*C2+0.0108*D2-0.0233*E2+0.1191,IF(AND(A2>6.5,A2>10.5),0.1259*A2+0.1856*C2+0*D2-1.0226,IF(A2>6.5,0.1222*A2+0.113*C2+0.0001*D2-0.6161,IF(AND(A2>3.5,D2>37.5),0.1229*A2+0.061*C2+0.0002*D2-0.3746,IF(AND(A2>1.5,D2<=62.5),0.0481*A2+0.0008*C2+0.0026*D2+0.0041,IF(A2>1.5,0.0071*A2+0.0219*C2+0.2009,0.1423)))))))

When I evaluate the predicted values, they are way, way off, as shown in the
picture. The points should be close to the 45 deg line:

<https://weka.8497.n7.nabble.com/file/t6958/1.png>

I am completely buffled, and I suspect it is something obvious but cannot
find it. The same occurs with properly trained models and many rules. Any
ideas?

TIA




--
Sent from: https://weka.8497.n7.nabble.com/
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Possible error with M5Rules?

achar
Yes I have, but they are nowhere near what I get with WEKA. The predictions
are obviously wrong, even with properly trained models and many rules. I
feel something is amiss. The formulas in Excel are nested, and they are
evaluated in the same order as in the rule output.

I have built simple nonlinear models with Mathematica on the same data.
After playing around with the data, to find out which columns are important,
what model to use, etc, I came up with formulas that are quite small
(one-half liners), and I get very good results (as in the attached picture).

<https://weka.8497.n7.nabble.com/file/t6958/1.png>

I was hoping for something better with ML. I think I will try a tree, maybe
M5P, and see if I am lucky.

TIA





--
Sent from: https://weka.8497.n7.nabble.com/
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Possible error with M5Rules?

Eibe Frank-2
Administrator
Is it possible that the reason is simply insufficient precision in the output of the coefficients that are included in the M5Rules model? For example, in one of the linear models in your output, a coefficient has been rounded to zero!

I have just committed changes to M5P and M5Rules so that they now respect the number of decimal places specified by -num-decimal-places when these classifiers are configured by the user. These changes are available in the nightly snapshots available from


(but it seems I have just missed the deadline for tonight's 3.8 build process so please use the developer-branch version, which is currently still pretty much the same anyway).

Please try the new version out with an increased precision in the output and let us know if this was the reason.

Cheers,
Eibe

On Sun, Oct 20, 2019 at 8:42 PM achar <[hidden email]> wrote:
Yes I have, but they are nowhere near what I get with WEKA. The predictions
are obviously wrong, even with properly trained models and many rules. I
feel something is amiss. The formulas in Excel are nested, and they are
evaluated in the same order as in the rule output.

I have built simple nonlinear models with Mathematica on the same data.
After playing around with the data, to find out which columns are important,
what model to use, etc, I came up with formulas that are quite small
(one-half liners), and I get very good results (as in the attached picture).

<https://weka.8497.n7.nabble.com/file/t6958/1.png>

I was hoping for something better with ML. I think I will try a tree, maybe
M5P, and see if I am lucky.

TIA





--
Sent from: https://weka.8497.n7.nabble.com/
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Possible error with M5Rules?

achar
In reply to this post by achar
I made a parser for M5P trees, and got the same bad result (not to mention
that the formulas cannot fit in the cells of Excel).

In frustration, I finally found what was wrong: my data was "sorted", which
apparently messes up the model building in WEKA (I suspect the folds, as I
assume that WEKA does not make stratified cross validation).

I randomized the order of rows in Mathematica, created a new arff file and
now I get good results with both M5Rules and M5P.

Thank you.



--
Sent from: https://weka.8497.n7.nabble.com/
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Possible error with M5Rules?

Peter Reutemann
> In frustration, I finally found what was wrong: my data was "sorted", which
> apparently messes up the model building in WEKA (I suspect the folds, as I
> assume that WEKA does not make stratified cross validation).

Weka performs (randomized) stratified cross-validation. But, of
course, cross-validation only produces statistics from X different
models and not a single model.
However, the final model which string representation is output (and
which you can save as well) is built on the dataset "as is". Hence
that model may not perform according to the results of the
cross-validation.

Cheers, Peter
--
Peter Reutemann
Dept. of Computer Science
University of Waikato, NZ
+64 (7) 858-5174
http://www.cms.waikato.ac.nz/~fracpete/
http://www.data-mining.co.nz/
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html