Random forest accuracy

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

Random forest accuracy

Sehrish agha
Hello, 
Random forest is giving 100 percent accuracy. I have 7 attributes. Please tell me is there anything wrong . I have conducted 10 tests. All have 100 percent accuracy. 

_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Random forest accuracy

Peter Reutemann
> Random forest is giving 100 percent accuracy. I have 7 attributes. Please tell me is there anything wrong . I have conducted 10 tests. All have 100 percent accuracy.

How are you testing? Cross-validation?
Do you have an ID attribute in your dataset that would RandomForest
can associate with the class attribute?

Cheers, Peter
--
Peter Reutemann
Dept. of Computer Science
University of Waikato, NZ
+64 (7) 858-5174
http://www.cms.waikato.ac.nz/~fracpete/
http://www.data-mining.co.nz/
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Random forest accuracy

Sehrish agha
I am doing training only. I did cross validation but it was also 100 percent. And no i dont have id attribute. Is ID important for classification?

On Tue, Nov 12, 2019, 12:47 AM Peter Reutemann <[hidden email]> wrote:
> Random forest is giving 100 percent accuracy. I have 7 attributes. Please tell me is there anything wrong . I have conducted 10 tests. All have 100 percent accuracy.

How are you testing? Cross-validation?
Do you have an ID attribute in your dataset that would RandomForest
can associate with the class attribute?

Cheers, Peter
--
Peter Reutemann
Dept. of Computer Science
University of Waikato, NZ
+64 (7) 858-5174
http://www.cms.waikato.ac.nz/~fracpete/
http://www.data-mining.co.nz/
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Random forest accuracy

Peter Reutemann
> I am doing training only.

Do you mean you train and then evaluate against the training data?
That can easily report 100% and should never be used.

> I did cross validation but it was also 100 percent.

Unusual. Do you maybe have duplicate rows in there which could end up
in train/test, allowing the model to obtain a perfect model (by
cheating).

> And no i dont have id attribute. Is ID important for classification?

No, but they can be helpful in tracking predictions. However,  ID
attributes can leak information into the model.
For instance if you add an ID attribute to the iris UCI dataset, which
is sorted by class label, then a classifier can learn that IDs up to
50 are the first class label, the second class label between 50 and
100 and all others the third class label.

Cheers, Peter
--
Peter Reutemann
Dept. of Computer Science
University of Waikato, NZ
+64 (7) 858-5174
http://www.cms.waikato.ac.nz/~fracpete/
http://www.data-mining.co.nz/
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Random forest accuracy

Sehrish agha
> I am doing training only.

Do you mean you train and then evaluate against the training data?
That can easily report 100% and should never be used. 
  - If its not correct then is there any other step after training and before testing?

> I did cross validation but it was also 100 percent.

Unusual. Do you maybe have duplicate rows in there which could end up
in train/test, allowing the model to obtain a perfect model (by
cheating).
    - I take a dataset and then i divide it into test and train set. there are some duplicate values for some attributes. i am sharing my dataset, please tell me whats wrong in it.



On Tue, Nov 12, 2019 at 8:36 AM Peter Reutemann <[hidden email]> wrote:
> I am doing training only.

Do you mean you train and then evaluate against the training data?
That can easily report 100% and should never be used.

> I did cross validation but it was also 100 percent.

Unusual. Do you maybe have duplicate rows in there which could end up
in train/test, allowing the model to obtain a perfect model (by
cheating).

> And no i dont have id attribute. Is ID important for classification?

No, but they can be helpful in tracking predictions. However,  ID
attributes can leak information into the model.
For instance if you add an ID attribute to the iris UCI dataset, which
is sorted by class label, then a classifier can learn that IDs up to
50 are the first class label, the second class label between 50 and
100 and all others the third class label.

Cheers, Peter
--
Peter Reutemann
Dept. of Computer Science
University of Waikato, NZ
+64 (7) 858-5174
http://www.cms.waikato.ac.nz/~fracpete/
http://www.data-mining.co.nz/
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

test and train.zip (959K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Random forest accuracy

Peter Reutemann
>     - I take a dataset and then i divide it into test and train set. there are some duplicate values for some attributes. i am sharing my dataset, please tell me whats wrong in it.

These two datasets have differing values in
Source/Destination/Protocol/Source port/Dest port.

I'm not sure how you're loading these datasets but CSV files can be problematic.
For nominal and string attributes, Weka uses numeric indices
internally. If your dataset definitions differ between train and test
(different order or different subset of values in nominal/string
attribute), then the internal index 0 can mean two different things
for train and test. However, the built model will assume the meaning
from the training operation.

I recommend you to use ARFF files instead, which define the values for
nominal attributes in the header section. Splitting such a dataset and
then saving them once again as ARFF won't expose you to these
problems.

Cheers, Peter
--
Peter Reutemann
Dept. of Computer Science
University of Waikato, NZ
+64 (7) 858-5174
http://www.cms.waikato.ac.nz/~fracpete/
http://www.data-mining.co.nz/
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Random forest accuracy

Sehrish agha
I load these files in .arff format. But still accuracy is 100%. I just want to know is there any problem in dataset due to which accuracy is 100%? 

On Wed, Nov 13, 2019, 6:56 AM Peter Reutemann <[hidden email]> wrote:
>     - I take a dataset and then i divide it into test and train set. there are some duplicate values for some attributes. i am sharing my dataset, please tell me whats wrong in it.

These two datasets have differing values in
Source/Destination/Protocol/Source port/Dest port.

I'm not sure how you're loading these datasets but CSV files can be problematic.
For nominal and string attributes, Weka uses numeric indices
internally. If your dataset definitions differ between train and test
(different order or different subset of values in nominal/string
attribute), then the internal index 0 can mean two different things
for train and test. However, the built model will assume the meaning
from the training operation.

I recommend you to use ARFF files instead, which define the values for
nominal attributes in the header section. Splitting such a dataset and
then saving them once again as ARFF won't expose you to these
problems.

Cheers, Peter
--
Peter Reutemann
Dept. of Computer Science
University of Waikato, NZ
+64 (7) 858-5174
http://www.cms.waikato.ac.nz/~fracpete/
http://www.data-mining.co.nz/
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Random forest accuracy

Peter Reutemann-3
On November 14, 2019 5:51:37 PM GMT+13:00, sehrish Agha <[hidden email]> wrote:

>I load these files in .arff format. But still accuracy is 100%. I just
>want
>to know is there any problem in dataset due to which accuracy is 100%?
>
>On Wed, Nov 13, 2019, 6:56 AM Peter Reutemann <[hidden email]>
>wrote:
>
>> >     - I take a dataset and then i divide it into test and train
>set.
>> there are some duplicate values for some attributes. i am sharing my
>> dataset, please tell me whats wrong in it.
>>
>> These two datasets have differing values in
>> Source/Destination/Protocol/Source port/Dest port.
>>
>> I'm not sure how you're loading these datasets but CSV files can be
>> problematic.
>> For nominal and string attributes, Weka uses numeric indices
>> internally. If your dataset definitions differ between train and test
>> (different order or different subset of values in nominal/string
>> attribute), then the internal index 0 can mean two different things
>> for train and test. However, the built model will assume the meaning
>> from the training operation.
>>
>> I recommend you to use ARFF files instead, which define the values
>for
>> nominal attributes in the header section. Splitting such a dataset
>and
>> then saving them once again as ARFF won't expose you to these
>> problems.
>>
>> Cheers, Peter
>> --
>> Peter Reutemann
>> Dept. of Computer Science
>> University of Waikato, NZ
>> +64 (7) 858-5174
>> http://www.cms.waikato.ac.nz/~fracpete/
>> http://www.data-mining.co.nz/
>> _______________________________________________
>> Wekalist mailing list -- [hidden email]
>> Send posts to: To unsubscribe send an email to
>> [hidden email]
>> To subscribe, unsubscribe, etc., visit
>>
>https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
>> List etiquette:
>> http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
>>

Is it the same problem that your time attribute can be used to accurately separate the two classes? Remove it and try building a model again.

Cheers, Peter
--
Peter Reutemann
Dept. of Computer Science
University of Waikato, NZ
+64 (7) 858-5174
http://www.cms.waikato.ac.nz/~fracpete/
http://www.data-mining.co.nz/
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Random forest accuracy

Sehrish agha
No,  i have changed values for time attribute. Now they are consistent. You can see in my data files. Other attribute values are similar for some instances. Is it due to that? 

On Thu, Nov 14, 2019, 2:29 PM Peter Reutemann <[hidden email]> wrote:
On November 14, 2019 5:51:37 PM GMT+13:00, sehrish Agha <[hidden email]> wrote:
>I load these files in .arff format. But still accuracy is 100%. I just
>want
>to know is there any problem in dataset due to which accuracy is 100%?
>
>On Wed, Nov 13, 2019, 6:56 AM Peter Reutemann <[hidden email]>
>wrote:
>
>> >     - I take a dataset and then i divide it into test and train
>set.
>> there are some duplicate values for some attributes. i am sharing my
>> dataset, please tell me whats wrong in it.
>>
>> These two datasets have differing values in
>> Source/Destination/Protocol/Source port/Dest port.
>>
>> I'm not sure how you're loading these datasets but CSV files can be
>> problematic.
>> For nominal and string attributes, Weka uses numeric indices
>> internally. If your dataset definitions differ between train and test
>> (different order or different subset of values in nominal/string
>> attribute), then the internal index 0 can mean two different things
>> for train and test. However, the built model will assume the meaning
>> from the training operation.
>>
>> I recommend you to use ARFF files instead, which define the values
>for
>> nominal attributes in the header section. Splitting such a dataset
>and
>> then saving them once again as ARFF won't expose you to these
>> problems.
>>
>> Cheers, Peter
>> --
>> Peter Reutemann
>> Dept. of Computer Science
>> University of Waikato, NZ
>> +64 (7) 858-5174
>> http://www.cms.waikato.ac.nz/~fracpete/
>> http://www.data-mining.co.nz/
>> _______________________________________________
>> Wekalist mailing list -- [hidden email]
>> Send posts to: To unsubscribe send an email to
>> [hidden email]
>> To subscribe, unsubscribe, etc., visit
>>
>https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
>> List etiquette:
>> http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
>>

Is it the same problem that your time attribute can be used to accurately separate the two classes? Remove it and try building a model again.

Cheers, Peter
--
Peter Reutemann
Dept. of Computer Science
University of Waikato, NZ
+64 (7) 858-5174
http://www.cms.waikato.ac.nz/~fracpete/
http://www.data-mining.co.nz/
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Random forest accuracy

Peter Reutemann
> No,  i have changed values for time attribute. Now they are consistent. You can see in my data files. Other attribute values are similar for some instances. Is it due to that?

Your data is easy to separate, just by using the "Source" attribute.

For the test data the following holds true:
Source = "172.16.0.1" -> F
everything else -> A

I combined your train/test data into a single CSV file, loaded that in
and built a J48 model using a train/test split of 66.667 with no
randomization.
As you can see, the only time that the label "A" needs to be
predicted, is when "Source = 172.16.0.1".

Visualizing the attributes in respect to the class label, like the
Explorer does in the Preprocess tab (or Visualize tab), allows you to
spot that as well.

Cheers, Peter
--
Peter Reutemann
Dept. of Computer Science
University of Waikato, NZ
+64 (7) 858-5174
http://www.cms.waikato.ac.nz/~fracpete/
http://www.data-mining.co.nz/

_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

j48.txt (2K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Random forest accuracy

Sehrish agha
Okay. I got the reason. 
Thanks alot for your help.

On Fri, Nov 15, 2019, 4:50 AM Peter Reutemann <[hidden email]> wrote:
> No,  i have changed values for time attribute. Now they are consistent. You can see in my data files. Other attribute values are similar for some instances. Is it due to that?

Your data is easy to separate, just by using the "Source" attribute.

For the test data the following holds true:
Source = "172.16.0.1" -> F
everything else -> A

I combined your train/test data into a single CSV file, loaded that in
and built a J48 model using a train/test split of 66.667 with no
randomization.
As you can see, the only time that the label "A" needs to be
predicted, is when "Source = 172.16.0.1".

Visualizing the attributes in respect to the class label, like the
Explorer does in the Preprocess tab (or Visualize tab), allows you to
spot that as well.

Cheers, Peter
--
Peter Reutemann
Dept. of Computer Science
University of Waikato, NZ
+64 (7) 858-5174
http://www.cms.waikato.ac.nz/~fracpete/
http://www.data-mining.co.nz/
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html