Difficulty in preprocessing attribute values.

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Difficulty in preprocessing attribute values.

Sehrish agha
Hello,
I am facing a difficulty in preprocessing my dataset. Details of dataset are
given below:

1. I have 2 classes in my dataset and one of my attribute in both classes is
'Time'

2. 'Time' attribute has completely different values for both classes. for
class 1: time <500 , for class 2: time> 500

I am using random forest algorithm and it is taking 'Time' attribute as
important attribute and it generates accuracy of 100%. where as i have other
important attributes as well(Source ip, destination ip, Length , protocol)
which should be considered first (i think).

 Please tell how to preprocess 'Time' attribute values so that my classifier
do fair classification.



--
Sent from: https://weka.8497.n7.nabble.com/
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Difficulty in preprocessing attribute values.

Ted Cary
Hello,

You can just remove the 'Time' attribute if you don't want the classifier (Random Forest) to use it.  In Explorer, click the checkbox next to it and then hit the 'Remove' button.  If as you say the 'Time' attribute has completely different value for the two classes, any good classifier will generate a model with accuracy near 100% if you include 'Time' in your training.  
Are you absolutely sure that the model is not fair, though?  If 'Time' is fair input, then it is fair to separate the classes by just finding a threshold on 'Time' -- is it possible you just have an easy problem? The model should be fair if the attributes are "fair" and as long as you are cross-validating the model correctly by training and testing on different (holdout) cases.  If you *don't* want to use 'Time' though, then just remove it to train the model on the remaining attributes.  -TC



Ted

On Mon, Oct 21, 2019 at 6:04 AM Sehrish agha <[hidden email]> wrote:
Hello,
I am facing a difficulty in preprocessing my dataset. Details of dataset are
given below:

1. I have 2 classes in my dataset and one of my attribute in both classes is
'Time'

2. 'Time' attribute has completely different values for both classes. for
class 1: time <500 , for class 2: time> 500

I am using random forest algorithm and it is taking 'Time' attribute as
important attribute and it generates accuracy of 100%. where as i have other
important attributes as well(Source ip, destination ip, Length , protocol)
which should be considered first (i think).

 Please tell how to preprocess 'Time' attribute values so that my classifier
do fair classification.



--
Sent from: https://weka.8497.n7.nabble.com/
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Difficulty in preprocessing attribute values.

Sehrish agha
Thanks alot. 
Actually i have applied SMOTE on minority class , due to which my 'Time' attribute values are replicated. So i dont have values greater than 100 in one class. And in majority class its greater than 700. Please tell me is that better to remove Time or some other technique can be applied to make values of 'Time' consistent in both classes?

On Tue, Oct 22, 2019, 2:37 AM Ted Cary <[hidden email]> wrote:
Hello,

You can just remove the 'Time' attribute if you don't want the classifier (Random Forest) to use it.  In Explorer, click the checkbox next to it and then hit the 'Remove' button.  If as you say the 'Time' attribute has completely different value for the two classes, any good classifier will generate a model with accuracy near 100% if you include 'Time' in your training.  
Are you absolutely sure that the model is not fair, though?  If 'Time' is fair input, then it is fair to separate the classes by just finding a threshold on 'Time' -- is it possible you just have an easy problem? The model should be fair if the attributes are "fair" and as long as you are cross-validating the model correctly by training and testing on different (holdout) cases.  If you *don't* want to use 'Time' though, then just remove it to train the model on the remaining attributes.  -TC



Ted

On Mon, Oct 21, 2019 at 6:04 AM Sehrish agha <[hidden email]> wrote:
Hello,
I am facing a difficulty in preprocessing my dataset. Details of dataset are
given below:

1. I have 2 classes in my dataset and one of my attribute in both classes is
'Time'

2. 'Time' attribute has completely different values for both classes. for
class 1: time <500 , for class 2: time> 500

I am using random forest algorithm and it is taking 'Time' attribute as
important attribute and it generates accuracy of 100%. where as i have other
important attributes as well(Source ip, destination ip, Length , protocol)
which should be considered first (i think).

 Please tell how to preprocess 'Time' attribute values so that my classifier
do fair classification.



--
Sent from: https://weka.8497.n7.nabble.com/
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Difficulty in preprocessing attribute values.

Ted Cary
First: Are you sure that Time is not simply a very predictive attribute that is sufficient for differentiating the classes perfectly?
Your study design will determine whether you want to use Time as an attribute or not.
If it's a valid attribute for your purposes, then you might just have easy data to model!

I'm not exactly clear on your problem, but answering the questions below might help.
For your data, did SMOTE create additional cases with new Time values all below 100, or did it actually exactly "replicate" old Time values?

SMOTE will create additional synthetic instances ("rows"), but it should not create new attributes ("columns"). 
Below is a toy example using time and distance to distinguish between WALKER and RUNNER.
The WALKER class is over-represented, so maybe SMOTE would generate a synthetic RUNNER to have more minority-class cases.
SMOTE doesn't generate new strings for the names, so it just generated a new synthetic runner also named MOE.
The new synthetic Moe is slightly slower than the old Moe, but he still ran a mile in 5 minutes.
There is a new Time *value*, and it's still under 5 minutes and far away from WALKER times.
It's a different value but still the same attribute.
There should still be only one Time column.
Both Moes are RUNNERs and both ran a mile in under 5:00, times much less than the WALKER times, which is expected.

Check if your data is formatted correctly for Weka -- are columns and rows somehow reversed?
If your time values were replicated exactly instead, is Time formatted as 'string' instead of 'numeric'?
Otherwise I suspect Time is just a very predictive attribute, just like it is in the toy example.
SMOTE creates synthetic instances, but if the times for the two classes were already very well separated before SMOTE, the new synthetic cases would also be separable by time.

ORIGINAL DATA
     | time |distance| CLASS
--------------------------
Moe  | 4:00 | 1 mile | RUNNER
Larry|15:00 | 1 mile | WALKER
Curly|40:00 | 2 miles| WALKER


AFTER SMOTE
     | time |distance| CLASS
--------------------------
Moe  | 4:00 | 1 mile | RUNNER
Moe  | 5:00 | 1 mile | RUNNER
Larry|15:00 | 1 mile | WALKER
Curly|40:00 | 2 miles| WALKER

 


On Mon, Oct 21, 2019 at 9:46 PM sehrish Agha <[hidden email]> wrote:
Thanks alot. 
Actually i have applied SMOTE on minority class , due to which my 'Time' attribute values are replicated. So i dont have values greater than 100 in one class. And in majority class its greater than 700. Please tell me is that better to remove Time or some other technique can be applied to make values of 'Time' consistent in both classes?

On Tue, Oct 22, 2019, 2:37 AM Ted Cary <[hidden email]> wrote:
Hello,

You can just remove the 'Time' attribute if you don't want the classifier (Random Forest) to use it.  In Explorer, click the checkbox next to it and then hit the 'Remove' button.  If as you say the 'Time' attribute has completely different value for the two classes, any good classifier will generate a model with accuracy near 100% if you include 'Time' in your training.  
Are you absolutely sure that the model is not fair, though?  If 'Time' is fair input, then it is fair to separate the classes by just finding a threshold on 'Time' -- is it possible you just have an easy problem? The model should be fair if the attributes are "fair" and as long as you are cross-validating the model correctly by training and testing on different (holdout) cases.  If you *don't* want to use 'Time' though, then just remove it to train the model on the remaining attributes.  -TC



Ted

On Mon, Oct 21, 2019 at 6:04 AM Sehrish agha <[hidden email]> wrote:
Hello,
I am facing a difficulty in preprocessing my dataset. Details of dataset are
given below:

1. I have 2 classes in my dataset and one of my attribute in both classes is
'Time'

2. 'Time' attribute has completely different values for both classes. for
class 1: time <500 , for class 2: time> 500

I am using random forest algorithm and it is taking 'Time' attribute as
important attribute and it generates accuracy of 100%. where as i have other
important attributes as well(Source ip, destination ip, Length , protocol)
which should be considered first (i think).

 Please tell how to preprocess 'Time' attribute values so that my classifier
do fair classification.



--
Sent from: https://weka.8497.n7.nabble.com/
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Difficulty in preprocessing attribute values.

Sehrish agha
Replicate values are created for time. And time is numeric. Is their any way to create values which are not replicate of old values in time attribute. 

On Fri, Nov 8, 2019, 12:14 AM Ted Cary <[hidden email]> wrote:
First: Are you sure that Time is not simply a very predictive attribute that is sufficient for differentiating the classes perfectly?
Your study design will determine whether you want to use Time as an attribute or not.
If it's a valid attribute for your purposes, then you might just have easy data to model!

I'm not exactly clear on your problem, but answering the questions below might help.
For your data, did SMOTE create additional cases with new Time values all below 100, or did it actually exactly "replicate" old Time values?

SMOTE will create additional synthetic instances ("rows"), but it should not create new attributes ("columns"). 
Below is a toy example using time and distance to distinguish between WALKER and RUNNER.
The WALKER class is over-represented, so maybe SMOTE would generate a synthetic RUNNER to have more minority-class cases.
SMOTE doesn't generate new strings for the names, so it just generated a new synthetic runner also named MOE.
The new synthetic Moe is slightly slower than the old Moe, but he still ran a mile in 5 minutes.
There is a new Time *value*, and it's still under 5 minutes and far away from WALKER times.
It's a different value but still the same attribute.
There should still be only one Time column.
Both Moes are RUNNERs and both ran a mile in under 5:00, times much less than the WALKER times, which is expected.

Check if your data is formatted correctly for Weka -- are columns and rows somehow reversed?
If your time values were replicated exactly instead, is Time formatted as 'string' instead of 'numeric'?
Otherwise I suspect Time is just a very predictive attribute, just like it is in the toy example.
SMOTE creates synthetic instances, but if the times for the two classes were already very well separated before SMOTE, the new synthetic cases would also be separable by time.

ORIGINAL DATA
     | time |distance| CLASS
--------------------------
Moe  | 4:00 | 1 mile | RUNNER
Larry|15:00 | 1 mile | WALKER
Curly|40:00 | 2 miles| WALKER


AFTER SMOTE
     | time |distance| CLASS
--------------------------
Moe  | 4:00 | 1 mile | RUNNER
Moe  | 5:00 | 1 mile | RUNNER
Larry|15:00 | 1 mile | WALKER
Curly|40:00 | 2 miles| WALKER

 


On Mon, Oct 21, 2019 at 9:46 PM sehrish Agha <[hidden email]> wrote:
Thanks alot. 
Actually i have applied SMOTE on minority class , due to which my 'Time' attribute values are replicated. So i dont have values greater than 100 in one class. And in majority class its greater than 700. Please tell me is that better to remove Time or some other technique can be applied to make values of 'Time' consistent in both classes?

On Tue, Oct 22, 2019, 2:37 AM Ted Cary <[hidden email]> wrote:
Hello,

You can just remove the 'Time' attribute if you don't want the classifier (Random Forest) to use it.  In Explorer, click the checkbox next to it and then hit the 'Remove' button.  If as you say the 'Time' attribute has completely different value for the two classes, any good classifier will generate a model with accuracy near 100% if you include 'Time' in your training.  
Are you absolutely sure that the model is not fair, though?  If 'Time' is fair input, then it is fair to separate the classes by just finding a threshold on 'Time' -- is it possible you just have an easy problem? The model should be fair if the attributes are "fair" and as long as you are cross-validating the model correctly by training and testing on different (holdout) cases.  If you *don't* want to use 'Time' though, then just remove it to train the model on the remaining attributes.  -TC



Ted

On Mon, Oct 21, 2019 at 6:04 AM Sehrish agha <[hidden email]> wrote:
Hello,
I am facing a difficulty in preprocessing my dataset. Details of dataset are
given below:

1. I have 2 classes in my dataset and one of my attribute in both classes is
'Time'

2. 'Time' attribute has completely different values for both classes. for
class 1: time <500 , for class 2: time> 500

I am using random forest algorithm and it is taking 'Time' attribute as
important attribute and it generates accuracy of 100%. where as i have other
important attributes as well(Source ip, destination ip, Length , protocol)
which should be considered first (i think).

 Please tell how to preprocess 'Time' attribute values so that my classifier
do fair classification.



--
Sent from: https://weka.8497.n7.nabble.com/
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Difficulty in preprocessing attribute values.

Ted Cary
That is strange -- SMOTE should create synthetic values for numeric attributes, not replicate discrete values.  In your source data, how many unique numeric values are there for Time? Try this: load your data in Weka  Explorer.  Select your TIME attribute from the list under the "Attributes" heading.  Now look under the "Selected attribute" heading -- how many 'Unique' values are there compared to 'Distinct'?  Now run SMOTE and do the same thing -- what are the new counts for 'Unique' and 'Distinct' ?   (Having said all of this, if your time attribute values are distributed so that the values are very different for the two classes, where there is an obvious threshold at time=500 between the classes, almost *any* classifier will model this with 100% accuracy! For this you do not need Weka or Random Forest or machine learning.  Your model could just be: is time greater than 500?)  HTH

_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Difficulty in preprocessing attribute values.

Sehrish agha
Unique and distinct counts have increased after smote. but still replicate values are observed in minority class "time" attribute. i am sharing my dataset for before and after smote. please have a look and tell me if i am missing something.




On Tue, Nov 12, 2019, 1:40 AM Ted Cary <[hidden email]> wrote:
That is strange -- SMOTE should create synthetic values for numeric attributes, not replicate discrete values.  In your source data, how many unique numeric values are there for Time? Try this: load your data in Weka  Explorer.  Select your TIME attribute from the list under the "Attributes" heading.  Now look under the "Selected attribute" heading -- how many 'Unique' values are there compared to 'Distinct'?  Now run SMOTE and do the same thing -- what are the new counts for 'Unique' and 'Distinct' ?   (Having said all of this, if your time attribute values are distributed so that the values are very different for the two classes, where there is an obvious threshold at time=500 between the classes, almost *any* classifier will model this with 100% accuracy! For this you do not need Weka or Random Forest or machine learning.  Your model could just be: is time greater than 500?)  HTH
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

before after smote.zip (1M) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Difficulty in preprocessing attribute values.

Ted Cary
Hi, labels A and F are perfectly separable by time -- that is the only model you need for these data.  That is why every model you train has 100% accuracy.  The problem is that this is not really a "problem" -- it can be solved on a spreadsheet, there is no need for machine learning. Given this data, assuming future instances are similarly distributed, it is very easy to know which cases are labeled A and which are labeled F -- anyone can look at any time and label a case as A or B. The only information any model has to learn is that all labels F have time less than 55, and all labels A all have time greater than 7942 -- that is a huge separation.  Sure, the class imbalance is also huge, with 184 F-label cases, and 50,0000 A-label cases. SMOTE is behaving reasonably given this input: any synthetic F cases it creates must still have time between 0 and 55. Either you've solved your problem perfectly, or time is an unfair attribute to use for training a model. Only you can know if time is "fair" or not -- it depends on what you are trying to accomplish.  However, knowing nothing else, if you are given these data and simply asked if you can predict A or F, assuming future data is generated similarly, then: yes, you can perfectly predict the class, and so can almost any classifier that you use in Weka.


Ted



Ted

On Tue, Nov 12, 2019 at 3:58 PM sehrish Agha <[hidden email]> wrote:
Unique and distinct counts have increased after smote. but still replicate values are observed in minority class "time" attribute. i am sharing my dataset for before and after smote. please have a look and tell me if i am missing something.




On Tue, Nov 12, 2019, 1:40 AM Ted Cary <[hidden email]> wrote:
That is strange -- SMOTE should create synthetic values for numeric attributes, not replicate discrete values.  In your source data, how many unique numeric values are there for Time? Try this: load your data in Weka  Explorer.  Select your TIME attribute from the list under the "Attributes" heading.  Now look under the "Selected attribute" heading -- how many 'Unique' values are there compared to 'Distinct'?  Now run SMOTE and do the same thing -- what are the new counts for 'Unique' and 'Distinct' ?   (Having said all of this, if your time attribute values are distributed so that the values are very different for the two classes, where there is an obvious threshold at time=500 between the classes, almost *any* classifier will model this with 100% accuracy! For this you do not need Weka or Random Forest or machine learning.  Your model could just be: is time greater than 500?)  HTH
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Difficulty in preprocessing attribute values.

Sehrish agha
Ok i got the point. Thanks alot for your help. 

On Wed, Nov 13, 2019, 2:44 AM Ted Cary <[hidden email]> wrote:
Hi, labels A and F are perfectly separable by time -- that is the only model you need for these data.  That is why every model you train has 100% accuracy.  The problem is that this is not really a "problem" -- it can be solved on a spreadsheet, there is no need for machine learning. Given this data, assuming future instances are similarly distributed, it is very easy to know which cases are labeled A and which are labeled F -- anyone can look at any time and label a case as A or B. The only information any model has to learn is that all labels F have time less than 55, and all labels A all have time greater than 7942 -- that is a huge separation.  Sure, the class imbalance is also huge, with 184 F-label cases, and 50,0000 A-label cases. SMOTE is behaving reasonably given this input: any synthetic F cases it creates must still have time between 0 and 55. Either you've solved your problem perfectly, or time is an unfair attribute to use for training a model. Only you can know if time is "fair" or not -- it depends on what you are trying to accomplish.  However, knowing nothing else, if you are given these data and simply asked if you can predict A or F, assuming future data is generated similarly, then: yes, you can perfectly predict the class, and so can almost any classifier that you use in Weka.


Ted



Ted

On Tue, Nov 12, 2019 at 3:58 PM sehrish Agha <[hidden email]> wrote:
Unique and distinct counts have increased after smote. but still replicate values are observed in minority class "time" attribute. i am sharing my dataset for before and after smote. please have a look and tell me if i am missing something.




On Tue, Nov 12, 2019, 1:40 AM Ted Cary <[hidden email]> wrote:
That is strange -- SMOTE should create synthetic values for numeric attributes, not replicate discrete values.  In your source data, how many unique numeric values are there for Time? Try this: load your data in Weka  Explorer.  Select your TIME attribute from the list under the "Attributes" heading.  Now look under the "Selected attribute" heading -- how many 'Unique' values are there compared to 'Distinct'?  Now run SMOTE and do the same thing -- what are the new counts for 'Unique' and 'Distinct' ?   (Having said all of this, if your time attribute values are distributed so that the values are very different for the two classes, where there is an obvious threshold at time=500 between the classes, almost *any* classifier will model this with 100% accuracy! For this you do not need Weka or Random Forest or machine learning.  Your model could just be: is time greater than 500?)  HTH
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Difficulty in preprocessing attribute values.

Ted Cary
You're welcome! Correction: I used B above when I meant F, but I hope it was not too confusing. (I also think some of your other attribute columns can easily separate the two classes, too, not just the 'time' column).  You are using a sledgehammer to crack a nut. :)

On Wed, Nov 13, 2019 at 11:56 PM sehrish Agha <[hidden email]> wrote:
Ok i got the point. Thanks alot for your help. 

On Wed, Nov 13, 2019, 2:44 AM Ted Cary <[hidden email]> wrote:
Hi, labels A and F are perfectly separable by time -- that is the only model you need for these data.  That is why every model you train has 100% accuracy.  The problem is that this is not really a "problem" -- it can be solved on a spreadsheet, there is no need for machine learning. Given this data, assuming future instances are similarly distributed, it is very easy to know which cases are labeled A and which are labeled F -- anyone can look at any time and label a case as A or B. The only information any model has to learn is that all labels F have time less than 55, and all labels A all have time greater than 7942 -- that is a huge separation.  Sure, the class imbalance is also huge, with 184 F-label cases, and 50,0000 A-label cases. SMOTE is behaving reasonably given this input: any synthetic F cases it creates must still have time between 0 and 55. Either you've solved your problem perfectly, or time is an unfair attribute to use for training a model. Only you can know if time is "fair" or not -- it depends on what you are trying to accomplish.  However, knowing nothing else, if you are given these data and simply asked if you can predict A or F, assuming future data is generated similarly, then: yes, you can perfectly predict the class, and so can almost any classifier that you use in Weka.


Ted



Ted

On Tue, Nov 12, 2019 at 3:58 PM sehrish Agha <[hidden email]> wrote:
Unique and distinct counts have increased after smote. but still replicate values are observed in minority class "time" attribute. i am sharing my dataset for before and after smote. please have a look and tell me if i am missing something.




On Tue, Nov 12, 2019, 1:40 AM Ted Cary <[hidden email]> wrote:
That is strange -- SMOTE should create synthetic values for numeric attributes, not replicate discrete values.  In your source data, how many unique numeric values are there for Time? Try this: load your data in Weka  Explorer.  Select your TIME attribute from the list under the "Attributes" heading.  Now look under the "Selected attribute" heading -- how many 'Unique' values are there compared to 'Distinct'?  Now run SMOTE and do the same thing -- what are the new counts for 'Unique' and 'Distinct' ?   (Having said all of this, if your time attribute values are distributed so that the values are very different for the two classes, where there is an obvious threshold at time=500 between the classes, almost *any* classifier will model this with 100% accuracy! For this you do not need Weka or Random Forest or machine learning.  Your model could just be: is time greater than 500?)  HTH
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to: To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit
https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html