outlier generation - network intrusion

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

outlier generation - network intrusion

Luisa
Hello,

I want to transform a binary ANN classifier into an outlier detector. I have plenty of experimental healthy network signals. I saw some works that generate artificial outliers (intrusion samples), uniformily distributed instances in the feature space. My questions are:

- Is this method reliable?
- What number of outliers would be expressive? If I have high dimensionality, I would need a lot, right?
- The false positive would indicate the percentage of outliers that are positioned within the class, right? If the data is uniformilly distributed, wouldn't that make my classfier overfit the real positive samples?

After I train my ANN, I test it in a new dataset. This dataset has unseen healthy data and some outliers (but now, they are real samples - 20% of the dataset is real intrusion samples). My questions are:

- If my classifier performs well in the test, could I generalize that the method worked?
- What would the false positive and false negative rate represent in this case?

Thanks.
Luisa.

_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: outlier generation - network intrusion

Eibe Frank-2
Administrator

> On 4/05/2017, at 7:50 AM, Luisa <[hidden email]> wrote:
>
> I want to transform a binary ANN classifier into an outlier detector. I have plenty of experimental healthy network signals. I saw some works that generate artificial outliers (intrusion samples), uniformily distributed instances in the feature space. My questions are:
>
> - Is this method reliable?

Probably only for a small number of attributes.

> - What number of outliers would be expressive? If I have high dimensionality, I would need a lot, right?

Yes, the amount of required data should increase exponentially with the number of attributes.

You could try using the OneClassClassifier in WEKA instead, which does something a bit more sophisticated to reduce the amount of artificial data that needs to be generated. The details are described in a paper (the reference comes with OneClassClassifier).

Unless the outlier class in your data actually has the class label “outlier”, the output of OneClassClassifier is a bit unusual for WEKA. Here is an example, using it on the pima-indians diabetes data with an MLP that has five hidden units (the “tested_negative” class in the diabetes data is treated as the target class, i.e., the positive class, and “tested_positive” cases are treated as outliers):

------

java weka.Run .OneClassClassifier -W .MLPClassifier -t ~/datasets/UCI/diabetes.arff -tcl "tested_negative" -I -- -N 5



=== Stratified cross-validation ===

Correctly Classified Instances         444               57.8125 %
Incorrectly Classified Instances       194               25.2604 %
Kappa statistic                          0    
EER                                      0.5  
Quadratic Weighted Kappa 0.0Mean absolute error                      0.3041
Root mean squared error                  0.5514
Relative absolute error                 83.0046 %
Root relative squared error            130.8887 %
UnClassified Instances                 130               16.9271 %
Total Number of Instances              768    


=== Detailed Accuracy By Class ===

                 TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
                 1.000    1.000    0.696      1.000    0.821      0.000    0.582     0.691     tested_negative
                 0.000    0.000    0.000      0.000    0.000      0.000    0.500     0.349     tested_positive
Weighted Avg.    0.696    0.696    0.484      0.696    0.571      0.000    0.557     0.587    


=== Confusion Matrix ===

   a   b   <-- classified as
 444   0 |   a = tested_negative
 194   0 |   b = tested_positive

------

The unusual bit in the output is that predicted outliers are counted as “UnClassified Instances” and not included in the confusion matrix, etc., so you will need to calculate the actual false positive rate and true positive rate manually. The diabetes data has 500 positive instances and 268 negative instances, so, given the above confusion matrix, FP rate = 194/268 = 72% and TP rate = 444/500 = 89%. Consequently, FN rate = 1 - TP rate = 11%. The FN rate is called “rejection rate” in OneClassClassifier and you can adjust the target rejection rate using a parameter (it should perhaps really be called *false* rejection rate, to make the meaning clearer).

It’s probably easier to simply rename “tested_positive” to “outlier” in the actual data file so that you don’t have to do manual calculations. The output then looks like this:

java weka.Run .OneClassClassifier -W .MLPClassifier -t diabetes.modified.arff -tcl "tested_negative" -I -- -N 5



=== Stratified cross-validation ===

Correctly Classified Instances         525               68.3594 %
Incorrectly Classified Instances       243               31.6406 %
Kappa statistic                          0.2068
EER                                      0.3881
Quadratic Weighted Kappa 0.20682737751181524Mean absolute error                      0.3258
Root mean squared error                  0.5354
Relative absolute error                 71.6878 %
Root relative squared error            112.3329 %
Total Number of Instances              768    


=== Detailed Accuracy By Class ===

                 TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
                 0.898    0.716    0.700      0.898    0.787      0.233    0.673     0.780     tested_negative
                 0.284    0.102    0.598      0.284    0.385      0.233    0.673     0.515     outlier
Weighted Avg.    0.684    0.502    0.665      0.684    0.647      0.233    0.673     0.688    


=== Confusion Matrix ===

   a   b   <-- classified as
 449  51 |   a = tested_negative
 192  76 |   b = outlier

Note that the result is slightly different, probably because the data is processed differently.

> - The false positive would indicate the percentage of outliers that are positioned within the class, right? If the data is uniformilly distributed, wouldn't that make my classfier overfit the real positive samples?

I agree, you’d have to be careful to avoid overfitting, making sure that the model is sufficiently simple so that the chance of overfitting is reduced.

> After I train my ANN, I test it in a new dataset. This dataset has unseen healthy data and some outliers (but now, they are real samples - 20% of the dataset is real intrusion samples). My questions are:
>
> - If my classifier performs well in the test, could I generalize that the method worked?

Yes, given that you have ground truth for testing, but the size of the test set is obviously also important.

> - What would the false positive and false negative rate represent in this case?

How many outliers are classified as normal traffic and how many cases of normal traffic are classified as outliers respectively.

Cheers,
Eibe

_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: outlier generation - network intrusion

Luisa
Eibe, thank you for your answer!

I was generating the same number of outliers as the target class for a 20 attribute space (scenario 1) and 4 attribute space (scenario 2).
After the training, I was testing it on a dataset with 80% target and 20% outliers.

Let's suppose that the number of outliers generated in training suffices. 

Then, in the testing, would the data imbalance influence in the classification? 

In absolute terms, I have more than 2000 instances of real outliers, which is less than the 11000 target class. However, it is still a large number of instances. So I am not sure if the dataset size during the testing will influence the classification.

If it does? Should I use oversample (SMOTE) or undersample? 

I thought that this data imbalance would be important in the training phase and not in the testing phase.

Cheers,


On Wed, May 3, 2017 at 8:27 PM, Eibe Frank <[hidden email]> wrote:

> On 4/05/2017, at 7:50 AM, Luisa <[hidden email]> wrote:
>
> I want to transform a binary ANN classifier into an outlier detector. I have plenty of experimental healthy network signals. I saw some works that generate artificial outliers (intrusion samples), uniformily distributed instances in the feature space. My questions are:
>
> - Is this method reliable?

Probably only for a small number of attributes.

> - What number of outliers would be expressive? If I have high dimensionality, I would need a lot, right?

Yes, the amount of required data should increase exponentially with the number of attributes.

You could try using the OneClassClassifier in WEKA instead, which does something a bit more sophisticated to reduce the amount of artificial data that needs to be generated. The details are described in a paper (the reference comes with OneClassClassifier).

Unless the outlier class in your data actually has the class label “outlier”, the output of OneClassClassifier is a bit unusual for WEKA. Here is an example, using it on the pima-indians diabetes data with an MLP that has five hidden units (the “tested_negative” class in the diabetes data is treated as the target class, i.e., the positive class, and “tested_positive” cases are treated as outliers):

------

java weka.Run .OneClassClassifier -W .MLPClassifier -t ~/datasets/UCI/diabetes.arff -tcl "tested_negative" -I -- -N 5



=== Stratified cross-validation ===

Correctly Classified Instances         444               57.8125 %
Incorrectly Classified Instances       194               25.2604 %
Kappa statistic                          0
EER                                      0.5
Quadratic Weighted Kappa 0.0Mean absolute error                      0.3041
Root mean squared error                  0.5514
Relative absolute error                 83.0046 %
Root relative squared error            130.8887 %
UnClassified Instances                 130               16.9271 %
Total Number of Instances              768


=== Detailed Accuracy By Class ===

                 TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
                 1.000    1.000    0.696      1.000    0.821      0.000    0.582     0.691     tested_negative
                 0.000    0.000    0.000      0.000    0.000      0.000    0.500     0.349     tested_positive
Weighted Avg.    0.696    0.696    0.484      0.696    0.571      0.000    0.557     0.587


=== Confusion Matrix ===

   a   b   <-- classified as
 444   0 |   a = tested_negative
 194   0 |   b = tested_positive

------

The unusual bit in the output is that predicted outliers are counted as “UnClassified Instances” and not included in the confusion matrix, etc., so you will need to calculate the actual false positive rate and true positive rate manually. The diabetes data has 500 positive instances and 268 negative instances, so, given the above confusion matrix, FP rate = 194/268 = 72% and TP rate = 444/500 = 89%. Consequently, FN rate = 1 - TP rate = 11%. The FN rate is called “rejection rate” in OneClassClassifier and you can adjust the target rejection rate using a parameter (it should perhaps really be called *false* rejection rate, to make the meaning clearer).

It’s probably easier to simply rename “tested_positive” to “outlier” in the actual data file so that you don’t have to do manual calculations. The output then looks like this:

java weka.Run .OneClassClassifier -W .MLPClassifier -t diabetes.modified.arff -tcl "tested_negative" -I -- -N 5



=== Stratified cross-validation ===

Correctly Classified Instances         525               68.3594 %
Incorrectly Classified Instances       243               31.6406 %
Kappa statistic                          0.2068
EER                                      0.3881
Quadratic Weighted Kappa 0.20682737751181524Mean absolute error                      0.3258
Root mean squared error                  0.5354
Relative absolute error                 71.6878 %
Root relative squared error            112.3329 %
Total Number of Instances              768


=== Detailed Accuracy By Class ===

                 TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
                 0.898    0.716    0.700      0.898    0.787      0.233    0.673     0.780     tested_negative
                 0.284    0.102    0.598      0.284    0.385      0.233    0.673     0.515     outlier
Weighted Avg.    0.684    0.502    0.665      0.684    0.647      0.233    0.673     0.688


=== Confusion Matrix ===

   a   b   <-- classified as
 449  51 |   a = tested_negative
 192  76 |   b = outlier

Note that the result is slightly different, probably because the data is processed differently.

> - The false positive would indicate the percentage of outliers that are positioned within the class, right? If the data is uniformilly distributed, wouldn't that make my classfier overfit the real positive samples?

I agree, you’d have to be careful to avoid overfitting, making sure that the model is sufficiently simple so that the chance of overfitting is reduced.

> After I train my ANN, I test it in a new dataset. This dataset has unseen healthy data and some outliers (but now, they are real samples - 20% of the dataset is real intrusion samples). My questions are:
>
> - If my classifier performs well in the test, could I generalize that the method worked?

Yes, given that you have ground truth for testing, but the size of the test set is obviously also important.

> - What would the false positive and false negative rate represent in this case?

How many outliers are classified as normal traffic and how many cases of normal traffic are classified as outliers respectively.

Cheers,
Eibe

_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html


_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: outlier generation - network intrusion

Eibe Frank-2
Administrator
No, don’t modify the test data. Test instances are processed independently from each other by all standard WEKA classifiers.

Cheers,
Eibe

> On 5/05/2017, at 4:38 AM, Luisa <[hidden email]> wrote:
>
> Eibe, thank you for your answer!
>
> I was generating the same number of outliers as the target class for a 20 attribute space (scenario 1) and 4 attribute space (scenario 2).
> After the training, I was testing it on a dataset with 80% target and 20% outliers.
>
> Let's suppose that the number of outliers generated in training suffices.
>
> Then, in the testing, would the data imbalance influence in the classification?
>
> In absolute terms, I have more than 2000 instances of real outliers, which is less than the 11000 target class. However, it is still a large number of instances. So I am not sure if the dataset size during the testing will influence the classification.
>
> If it does? Should I use oversample (SMOTE) or undersample?
>
> I thought that this data imbalance would be important in the training phase and not in the testing phase.
>
> Cheers,
>
>
> On Wed, May 3, 2017 at 8:27 PM, Eibe Frank <[hidden email]> wrote:
>
> > On 4/05/2017, at 7:50 AM, Luisa <[hidden email]> wrote:
> >
> > I want to transform a binary ANN classifier into an outlier detector. I have plenty of experimental healthy network signals. I saw some works that generate artificial outliers (intrusion samples), uniformily distributed instances in the feature space. My questions are:
> >
> > - Is this method reliable?
>
> Probably only for a small number of attributes.
>
> > - What number of outliers would be expressive? If I have high dimensionality, I would need a lot, right?
>
> Yes, the amount of required data should increase exponentially with the number of attributes.
>
> You could try using the OneClassClassifier in WEKA instead, which does something a bit more sophisticated to reduce the amount of artificial data that needs to be generated. The details are described in a paper (the reference comes with OneClassClassifier).
>
> Unless the outlier class in your data actually has the class label “outlier”, the output of OneClassClassifier is a bit unusual for WEKA. Here is an example, using it on the pima-indians diabetes data with an MLP that has five hidden units (the “tested_negative” class in the diabetes data is treated as the target class, i.e., the positive class, and “tested_positive” cases are treated as outliers):
>
> ------
>
> java weka.Run .OneClassClassifier -W .MLPClassifier -t ~/datasets/UCI/diabetes.arff -tcl "tested_negative" -I -- -N 5
>
> …
>
> === Stratified cross-validation ===
>
> Correctly Classified Instances         444               57.8125 %
> Incorrectly Classified Instances       194               25.2604 %
> Kappa statistic                          0
> EER                                      0.5
> Quadratic Weighted Kappa 0.0Mean absolute error                      0.3041
> Root mean squared error                  0.5514
> Relative absolute error                 83.0046 %
> Root relative squared error            130.8887 %
> UnClassified Instances                 130               16.9271 %
> Total Number of Instances              768
>
>
> === Detailed Accuracy By Class ===
>
>                  TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
>                  1.000    1.000    0.696      1.000    0.821      0.000    0.582     0.691     tested_negative
>                  0.000    0.000    0.000      0.000    0.000      0.000    0.500     0.349     tested_positive
> Weighted Avg.    0.696    0.696    0.484      0.696    0.571      0.000    0.557     0.587
>
>
> === Confusion Matrix ===
>
>    a   b   <-- classified as
>  444   0 |   a = tested_negative
>  194   0 |   b = tested_positive
>
> ------
>
> The unusual bit in the output is that predicted outliers are counted as “UnClassified Instances” and not included in the confusion matrix, etc., so you will need to calculate the actual false positive rate and true positive rate manually. The diabetes data has 500 positive instances and 268 negative instances, so, given the above confusion matrix, FP rate = 194/268 = 72% and TP rate = 444/500 = 89%. Consequently, FN rate = 1 - TP rate = 11%. The FN rate is called “rejection rate” in OneClassClassifier and you can adjust the target rejection rate using a parameter (it should perhaps really be called *false* rejection rate, to make the meaning clearer).
>
> It’s probably easier to simply rename “tested_positive” to “outlier” in the actual data file so that you don’t have to do manual calculations. The output then looks like this:
>
> java weka.Run .OneClassClassifier -W .MLPClassifier -t diabetes.modified.arff -tcl "tested_negative" -I -- -N 5
>
> …
>
> === Stratified cross-validation ===
>
> Correctly Classified Instances         525               68.3594 %
> Incorrectly Classified Instances       243               31.6406 %
> Kappa statistic                          0.2068
> EER                                      0.3881
> Quadratic Weighted Kappa 0.20682737751181524Mean absolute error                      0.3258
> Root mean squared error                  0.5354
> Relative absolute error                 71.6878 %
> Root relative squared error            112.3329 %
> Total Number of Instances              768
>
>
> === Detailed Accuracy By Class ===
>
>                  TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
>                  0.898    0.716    0.700      0.898    0.787      0.233    0.673     0.780     tested_negative
>                  0.284    0.102    0.598      0.284    0.385      0.233    0.673     0.515     outlier
> Weighted Avg.    0.684    0.502    0.665      0.684    0.647      0.233    0.673     0.688
>
>
> === Confusion Matrix ===
>
>    a   b   <-- classified as
>  449  51 |   a = tested_negative
>  192  76 |   b = outlier
>
> Note that the result is slightly different, probably because the data is processed differently.
>
> > - The false positive would indicate the percentage of outliers that are positioned within the class, right? If the data is uniformilly distributed, wouldn't that make my classfier overfit the real positive samples?
>
> I agree, you’d have to be careful to avoid overfitting, making sure that the model is sufficiently simple so that the chance of overfitting is reduced.
>
> > After I train my ANN, I test it in a new dataset. This dataset has unseen healthy data and some outliers (but now, they are real samples - 20% of the dataset is real intrusion samples). My questions are:
> >
> > - If my classifier performs well in the test, could I generalize that the method worked?
>
> Yes, given that you have ground truth for testing, but the size of the test set is obviously also important.
>
> > - What would the false positive and false negative rate represent in this case?
>
> How many outliers are classified as normal traffic and how many cases of normal traffic are classified as outliers respectively.
>
> Cheers,
> Eibe
>
> _______________________________________________
> Wekalist mailing list
> Send posts to: [hidden email]
> List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
> List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
>
> _______________________________________________
> Wekalist mailing list
> Send posts to: [hidden email]
> List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
> List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: outlier generation - network intrusion

Luisa
Thank you!

On Fri, May 5, 2017 at 1:10 AM, Eibe Frank <[hidden email]> wrote:
No, don’t modify the test data. Test instances are processed independently from each other by all standard WEKA classifiers.

Cheers,
Eibe

> On 5/05/2017, at 4:38 AM, Luisa <[hidden email]> wrote:
>
> Eibe, thank you for your answer!
>
> I was generating the same number of outliers as the target class for a 20 attribute space (scenario 1) and 4 attribute space (scenario 2).
> After the training, I was testing it on a dataset with 80% target and 20% outliers.
>
> Let's suppose that the number of outliers generated in training suffices.
>
> Then, in the testing, would the data imbalance influence in the classification?
>
> In absolute terms, I have more than 2000 instances of real outliers, which is less than the 11000 target class. However, it is still a large number of instances. So I am not sure if the dataset size during the testing will influence the classification.
>
> If it does? Should I use oversample (SMOTE) or undersample?
>
> I thought that this data imbalance would be important in the training phase and not in the testing phase.
>
> Cheers,
>
>
> On Wed, May 3, 2017 at 8:27 PM, Eibe Frank <[hidden email]> wrote:
>
> > On 4/05/2017, at 7:50 AM, Luisa <[hidden email]> wrote:
> >
> > I want to transform a binary ANN classifier into an outlier detector. I have plenty of experimental healthy network signals. I saw some works that generate artificial outliers (intrusion samples), uniformily distributed instances in the feature space. My questions are:
> >
> > - Is this method reliable?
>
> Probably only for a small number of attributes.
>
> > - What number of outliers would be expressive? If I have high dimensionality, I would need a lot, right?
>
> Yes, the amount of required data should increase exponentially with the number of attributes.
>
> You could try using the OneClassClassifier in WEKA instead, which does something a bit more sophisticated to reduce the amount of artificial data that needs to be generated. The details are described in a paper (the reference comes with OneClassClassifier).
>
> Unless the outlier class in your data actually has the class label “outlier”, the output of OneClassClassifier is a bit unusual for WEKA. Here is an example, using it on the pima-indians diabetes data with an MLP that has five hidden units (the “tested_negative” class in the diabetes data is treated as the target class, i.e., the positive class, and “tested_positive” cases are treated as outliers):
>
> ------
>
> java weka.Run .OneClassClassifier -W .MLPClassifier -t ~/datasets/UCI/diabetes.arff -tcl "tested_negative" -I -- -N 5
>
> …
>
> === Stratified cross-validation ===
>
> Correctly Classified Instances         444               57.8125 %
> Incorrectly Classified Instances       194               25.2604 %
> Kappa statistic                          0
> EER                                      0.5
> Quadratic Weighted Kappa 0.0Mean absolute error                      0.3041
> Root mean squared error                  0.5514
> Relative absolute error                 83.0046 %
> Root relative squared error            130.8887 %
> UnClassified Instances                 130               16.9271 %
> Total Number of Instances              768
>
>
> === Detailed Accuracy By Class ===
>
>                  TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
>                  1.000    1.000    0.696      1.000    0.821      0.000    0.582     0.691     tested_negative
>                  0.000    0.000    0.000      0.000    0.000      0.000    0.500     0.349     tested_positive
> Weighted Avg.    0.696    0.696    0.484      0.696    0.571      0.000    0.557     0.587
>
>
> === Confusion Matrix ===
>
>    a   b   <-- classified as
>  444   0 |   a = tested_negative
>  194   0 |   b = tested_positive
>
> ------
>
> The unusual bit in the output is that predicted outliers are counted as “UnClassified Instances” and not included in the confusion matrix, etc., so you will need to calculate the actual false positive rate and true positive rate manually. The diabetes data has 500 positive instances and 268 negative instances, so, given the above confusion matrix, FP rate = 194/268 = 72% and TP rate = 444/500 = 89%. Consequently, FN rate = 1 - TP rate = 11%. The FN rate is called “rejection rate” in OneClassClassifier and you can adjust the target rejection rate using a parameter (it should perhaps really be called *false* rejection rate, to make the meaning clearer).
>
> It’s probably easier to simply rename “tested_positive” to “outlier” in the actual data file so that you don’t have to do manual calculations. The output then looks like this:
>
> java weka.Run .OneClassClassifier -W .MLPClassifier -t diabetes.modified.arff -tcl "tested_negative" -I -- -N 5
>
> …
>
> === Stratified cross-validation ===
>
> Correctly Classified Instances         525               68.3594 %
> Incorrectly Classified Instances       243               31.6406 %
> Kappa statistic                          0.2068
> EER                                      0.3881
> Quadratic Weighted Kappa 0.20682737751181524Mean absolute error                      0.3258
> Root mean squared error                  0.5354
> Relative absolute error                 71.6878 %
> Root relative squared error            112.3329 %
> Total Number of Instances              768
>
>
> === Detailed Accuracy By Class ===
>
>                  TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
>                  0.898    0.716    0.700      0.898    0.787      0.233    0.673     0.780     tested_negative
>                  0.284    0.102    0.598      0.284    0.385      0.233    0.673     0.515     outlier
> Weighted Avg.    0.684    0.502    0.665      0.684    0.647      0.233    0.673     0.688
>
>
> === Confusion Matrix ===
>
>    a   b   <-- classified as
>  449  51 |   a = tested_negative
>  192  76 |   b = outlier
>
> Note that the result is slightly different, probably because the data is processed differently.
>
> > - The false positive would indicate the percentage of outliers that are positioned within the class, right? If the data is uniformilly distributed, wouldn't that make my classfier overfit the real positive samples?
>
> I agree, you’d have to be careful to avoid overfitting, making sure that the model is sufficiently simple so that the chance of overfitting is reduced.
>
> > After I train my ANN, I test it in a new dataset. This dataset has unseen healthy data and some outliers (but now, they are real samples - 20% of the dataset is real intrusion samples). My questions are:
> >
> > - If my classifier performs well in the test, could I generalize that the method worked?
>
> Yes, given that you have ground truth for testing, but the size of the test set is obviously also important.
>
> > - What would the false positive and false negative rate represent in this case?
>
> How many outliers are classified as normal traffic and how many cases of normal traffic are classified as outliers respectively.
>
> Cheers,
> Eibe
>
> _______________________________________________
> Wekalist mailing list
> Send posts to: [hidden email]
> List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
> List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
>
> _______________________________________________
> Wekalist mailing list
> Send posts to: [hidden email]
> List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
> List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html


_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Loading...