Random forest regression

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Random forest regression

rik ghosh
Hi,
 I have two questions.
1) What criterion does RF use during regression in splitting the trees? For example in classification it is either gini or information gain
 For regression is it MSE or MAE or something else?
2) The coefficient of correlation outputted during cross validated runs are the mean values from all the runs? 

_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Fwd: Random forest regression

rik ghosh


Hi,
 I have two questions.
1) What criterion does RF use during regression in splitting the trees? For example in classification it is either gini or information gain
 For regression is it MSE or MAE or something else?
2) The coefficient of correlation outputted during cross validated runs are the mean values from all the runs? 

_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Fwd: Random forest regression

Peter Reutemann
> 1) What criterion does RF use during regression in splitting the trees? For example in classification it is either gini or information gain
>  For regression is it MSE or MAE or something else?

Can't really comment on that. Have a look at the source code
(RandomForest uses RandomTree as base learner):
https://svn.cms.waikato.ac.nz/svn/weka/trunk/weka/src/main/java/weka/classifiers/trees/RandomTree.java

> 2) The coefficient of correlation outputted during cross validated runs are the mean values from all the runs?

In the Explorer or from the command-line, no. They statistics from the
folds get accumulated.
In the Experimenter, yes. The statistics from the various folds for
each run get used to calculate mean and stdev.

Cheers, Peter
--
Peter Reutemann
Dept. of Computer Science
University of Waikato, NZ
+64 (7) 858-5174
http://www.cms.waikato.ac.nz/~fracpete/
http://www.data-mining.co.nz/
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Fwd: Random forest regression

rik ghosh
Thanks for the response. 
I'd ask for a bit of clarification on the second question. What do you mean by " They statistics from the folds get accumulated."? How is the correlation coefficient that is outputted( In the explorer) calculated? 

Cheers , 
Rik.

On Thu, 19 Mar 2020 at 03:21, Peter Reutemann <[hidden email]> wrote:
> 1) What criterion does RF use during regression in splitting the trees? For example in classification it is either gini or information gain
>  For regression is it MSE or MAE or something else?

Can't really comment on that. Have a look at the source code
(RandomForest uses RandomTree as base learner):
https://svn.cms.waikato.ac.nz/svn/weka/trunk/weka/src/main/java/weka/classifiers/trees/RandomTree.java

> 2) The coefficient of correlation outputted during cross validated runs are the mean values from all the runs?

In the Explorer or from the command-line, no. They statistics from the
folds get accumulated.
In the Experimenter, yes. The statistics from the various folds for
each run get used to calculate mean and stdev.

Cheers, Peter
--
Peter Reutemann
Dept. of Computer Science
University of Waikato, NZ
+64 (7) 858-5174
http://www.cms.waikato.ac.nz/~fracpete/
http://www.data-mining.co.nz/
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Fwd: Random forest regression

Peter Reutemann-3
On March 19, 2020 6:41:40 PM GMT+13:00, rik ghosh <[hidden email]> wrote:

>Thanks for the response.
>I'd ask for a bit of clarification on the second question. What do you
>mean
>by " They statistics from the folds get accumulated."? How is the
>correlation coefficient that is outputted( In the explorer) calculated?
>
>Cheers ,
>Rik.
>
>On Thu, 19 Mar 2020 at 03:21, Peter Reutemann <[hidden email]>
>wrote:
>
>> > 1) What criterion does RF use during regression in splitting the
>trees?
>> For example in classification it is either gini or information gain
>> >  For regression is it MSE or MAE or something else?
>>
>> Can't really comment on that. Have a look at the source code
>> (RandomForest uses RandomTree as base learner):
>>
>>
>https://svn.cms.waikato.ac.nz/svn/weka/trunk/weka/src/main/java/weka/classifiers/trees/RandomTree.java
>>
>> > 2) The coefficient of correlation outputted during cross validated
>runs
>> are the mean values from all the runs?
>>
>> In the Explorer or from the command-line, no. They statistics from
>the
>> folds get accumulated.
>> In the Experimenter, yes. The statistics from the various folds for
>> each run get used to calculate mean and stdev.
>>
>> Cheers, Peter
>> --
>> Peter Reutemann
>> Dept. of Computer Science
>> University of Waikato, NZ
>> +64 (7) 858-5174
>> http://www.cms.waikato.ac.nz/~fracpete/
>> http://www.data-mining.co.nz/
>> _______________________________________________
>> Wekalist mailing list -- [hidden email]
>> Send posts to [hidden email]
>> To unsubscribe send an email to [hidden email]
>> To subscribe, unsubscribe, etc., visit
>>
>https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
>> List etiquette:
>> http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
>>

The statistics from the X folds get treated as if they came from a single test set, hence accumulated.

Cheers, Peter
--
Peter Reutemann
Dept. of Computer Science
University of Waikato, NZ
+64 (7) 858-5174
http://www.cms.waikato.ac.nz/~fracpete/
http://www.data-mining.co.nz/
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Fwd: Random forest regression

rik ghosh
So how is this value different from averaging correlations gained over X cross validations sets? The entire training set is run through the model as a test set?

On Thu, 19 Mar 2020 at 11:49, Peter Reutemann <[hidden email]> wrote:
On March 19, 2020 6:41:40 PM GMT+13:00, rik ghosh <[hidden email]> wrote:
>Thanks for the response.
>I'd ask for a bit of clarification on the second question. What do you
>mean
>by " They statistics from the folds get accumulated."? How is the
>correlation coefficient that is outputted( In the explorer) calculated?
>
>Cheers ,
>Rik.
>
>On Thu, 19 Mar 2020 at 03:21, Peter Reutemann <[hidden email]>
>wrote:
>
>> > 1) What criterion does RF use during regression in splitting the
>trees?
>> For example in classification it is either gini or information gain
>> >  For regression is it MSE or MAE or something else?
>>
>> Can't really comment on that. Have a look at the source code
>> (RandomForest uses RandomTree as base learner):
>>
>>
>https://svn.cms.waikato.ac.nz/svn/weka/trunk/weka/src/main/java/weka/classifiers/trees/RandomTree.java
>>
>> > 2) The coefficient of correlation outputted during cross validated
>runs
>> are the mean values from all the runs?
>>
>> In the Explorer or from the command-line, no. They statistics from
>the
>> folds get accumulated.
>> In the Experimenter, yes. The statistics from the various folds for
>> each run get used to calculate mean and stdev.
>>
>> Cheers, Peter
>> --
>> Peter Reutemann
>> Dept. of Computer Science
>> University of Waikato, NZ
>> +64 (7) 858-5174
>> http://www.cms.waikato.ac.nz/~fracpete/
>> http://www.data-mining.co.nz/
>> _______________________________________________
>> Wekalist mailing list -- [hidden email]
>> Send posts to [hidden email]
>> To unsubscribe send an email to [hidden email]
>> To subscribe, unsubscribe, etc., visit
>>
>https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
>> List etiquette:
>> http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
>>

The statistics from the X folds get treated as if they came from a single test set, hence accumulated.

Cheers, Peter
--
Peter Reutemann
Dept. of Computer Science
University of Waikato, NZ
+64 (7) 858-5174
http://www.cms.waikato.ac.nz/~fracpete/
http://www.data-mining.co.nz/
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Fwd: Random forest regression

Peter Reutemann
> So how is this value different from averaging correlations gained over X cross validations sets? The entire training set is run through the model as a test set?

If you perform 10-fold cross-validation on the UCI dataset "bolts" and
accumulate the predictions to compute a single correlation
coefficient, you get something like this when using LinearRegresssion
(no attribute selection, elimination of colinear attributes turned
off):
0.9187366909027179

When computing the correlation coefficient per fold and then average
it, you get this:
fold 1: 0.945944449233055
fold 2: -0.41160463785365725
fold 3: 0.9980728574483596
fold 4: 0.2063784891246402
fold 5: 0.995174801441272
fold 6: 0.7550608301703348
fold 7: 0.9610939612445496
fold 8: 0.9580342584781084
fold 9: 0.9786256633549173
fold 10: 0.9491100232883292
averaged: 0.7335890695929909

I've attached the ADAMS (https://adams.cms.waikato.ac.nz/) workflow
that generated that output.

Cheers, Peter
--
Peter Reutemann
Dept. of Computer Science
University of Waikato, NZ
+64 (7) 858-5174
http://www.cms.waikato.ac.nz/~fracpete/
http://www.data-mining.co.nz/

_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

cc.flow (5K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Fwd: Random forest regression

rik ghosh
Thanks. It's clear to me now.

On Fri, 20 Mar 2020 at 02:38, Peter Reutemann <[hidden email]> wrote:
> So how is this value different from averaging correlations gained over X cross validations sets? The entire training set is run through the model as a test set?

If you perform 10-fold cross-validation on the UCI dataset "bolts" and
accumulate the predictions to compute a single correlation
coefficient, you get something like this when using LinearRegresssion
(no attribute selection, elimination of colinear attributes turned
off):
0.9187366909027179

When computing the correlation coefficient per fold and then average
it, you get this:
fold 1: 0.945944449233055
fold 2: -0.41160463785365725
fold 3: 0.9980728574483596
fold 4: 0.2063784891246402
fold 5: 0.995174801441272
fold 6: 0.7550608301703348
fold 7: 0.9610939612445496
fold 8: 0.9580342584781084
fold 9: 0.9786256633549173
fold 10: 0.9491100232883292
averaged: 0.7335890695929909

I've attached the ADAMS (https://adams.cms.waikato.ac.nz/) workflow
that generated that output.

Cheers, Peter
--
Peter Reutemann
Dept. of Computer Science
University of Waikato, NZ
+64 (7) 858-5174
http://www.cms.waikato.ac.nz/~fracpete/
http://www.data-mining.co.nz/
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Random forest regression

Eibe Frank-2
Administrator
In reply to this post by rik ghosh
Regarding 1): info gain is used for classification and MSE (i.e., variance reduction) is used for regression (in RandomTree, which is what is used by RandomForest).

Cheers,
Eibe

On Mon, Mar 16, 2020 at 9:30 AM rik ghosh <[hidden email]> wrote:
Hi,
 I have two questions.
1) What criterion does RF use during regression in splitting the trees? For example in classification it is either gini or information gain
 For regression is it MSE or MAE or something else?
2) The coefficient of correlation outputted during cross validated runs are the mean values from all the runs? 
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Random forest regression

rik ghosh
Thank you for your reply.

Cheers,
Rik

On Sat, 21 Mar 2020 at 15:40, Eibe Frank <[hidden email]> wrote:
Regarding 1): info gain is used for classification and MSE (i.e., variance reduction) is used for regression (in RandomTree, which is what is used by RandomForest).

Cheers,
Eibe

On Mon, Mar 16, 2020 at 9:30 AM rik ghosh <[hidden email]> wrote:
Hi,
 I have two questions.
1) What criterion does RF use during regression in splitting the trees? For example in classification it is either gini or information gain
 For regression is it MSE or MAE or something else?
2) The coefficient of correlation outputted during cross validated runs are the mean values from all the runs? 
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html