Re: Wekalist Digest, Vol 171, Issue 23

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Wekalist Digest, Vol 171, Issue 23

micCalve
I have weka-3-8-1 on ubuntu, and no sign of weka.classifiers.meta.OneClassClassifier??

Sincerely,
Michael

On Wed, May 3, 2017 at 8:00 PM, <[hidden email]> wrote:
Send Wekalist mailing list submissions to
        [hidden email]

To subscribe or unsubscribe via the World Wide Web, visit
        https://list.waikato.ac.nz/mailman/listinfo/wekalist
or, via email, send a message with subject or body 'help' to
        [hidden email]

You can reach the person managing the list at
        [hidden email]

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Wekalist digest..."


Today's Topics:

   1. outlier generation - network intrusion (Luisa)
   2. Re: outlier generation - network intrusion (Eibe Frank)


----------------------------------------------------------------------

Message: 1
Date: Wed, 3 May 2017 16:50:16 -0300
From: Luisa <[hidden email]>
To: "Weka machine learning workbench list."
        <[hidden email]>
Subject: [Wekalist] outlier generation - network intrusion
Message-ID:
        <CACETB=zpsjeGfYcH4Vdump7x2=[hidden email]>
Content-Type: text/plain; charset="utf-8"

Hello,

I want to transform a binary ANN classifier into an outlier detector. I
have plenty of experimental healthy network signals. I saw some works that
generate artificial outliers (intrusion samples), uniformily distributed
instances in the feature space. My questions are:

- Is this method reliable?
- What number of outliers would be expressive? If I have high
dimensionality, I would need a lot, right?
- The false positive would indicate the percentage of outliers that are
positioned within the class, right? If the data is uniformilly distributed,
wouldn't that make my classfier overfit the real positive samples?

After I train my ANN, I test it in a new dataset. This dataset has unseen
healthy data and some outliers (but now, they are real samples - 20% of the
dataset is real intrusion samples). My questions are:

- If my classifier performs well in the test, could I generalize that the
method worked?
- What would the false positive and false negative rate represent in this
case?

Thanks.
Luisa.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://list.waikato.ac.nz/pipermail/wekalist/attachments/20170503/7ba449b8/attachment-0001.html>

------------------------------

Message: 2
Date: Thu, 4 May 2017 11:27:51 +1200
From: Eibe Frank <[hidden email]>
To: "Weka machine learning workbench list."
        <[hidden email]>
Subject: Re: [Wekalist] outlier generation - network intrusion
Message-ID: <[hidden email]>
Content-Type: text/plain; charset=utf-8


> On 4/05/2017, at 7:50 AM, Luisa <[hidden email]> wrote:
>
> I want to transform a binary ANN classifier into an outlier detector. I have plenty of experimental healthy network signals. I saw some works that generate artificial outliers (intrusion samples), uniformily distributed instances in the feature space. My questions are:
>
> - Is this method reliable?

Probably only for a small number of attributes.

> - What number of outliers would be expressive? If I have high dimensionality, I would need a lot, right?

Yes, the amount of required data should increase exponentially with the number of attributes.

You could try using the OneClassClassifier in WEKA instead, which does something a bit more sophisticated to reduce the amount of artificial data that needs to be generated. The details are described in a paper (the reference comes with OneClassClassifier).

Unless the outlier class in your data actually has the class label ?outlier?, the output of OneClassClassifier is a bit unusual for WEKA. Here is an example, using it on the pima-indians diabetes data with an MLP that has five hidden units (the ?tested_negative? class in the diabetes data is treated as the target class, i.e., the positive class, and ?tested_positive? cases are treated as outliers):

------

java weka.Run .OneClassClassifier -W .MLPClassifier -t ~/datasets/UCI/diabetes.arff -tcl "tested_negative" -I -- -N 5

?

=== Stratified cross-validation ===

Correctly Classified Instances         444               57.8125 %
Incorrectly Classified Instances       194               25.2604 %
Kappa statistic                          0
EER                                      0.5
Quadratic Weighted Kappa 0.0Mean absolute error                      0.3041
Root mean squared error                  0.5514
Relative absolute error                 83.0046 %
Root relative squared error            130.8887 %
UnClassified Instances                 130               16.9271 %
Total Number of Instances              768


=== Detailed Accuracy By Class ===

                 TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
                 1.000    1.000    0.696      1.000    0.821      0.000    0.582     0.691     tested_negative
                 0.000    0.000    0.000      0.000    0.000      0.000    0.500     0.349     tested_positive
Weighted Avg.    0.696    0.696    0.484      0.696    0.571      0.000    0.557     0.587


=== Confusion Matrix ===

   a   b   <-- classified as
 444   0 |   a = tested_negative
 194   0 |   b = tested_positive

------

The unusual bit in the output is that predicted outliers are counted as ?UnClassified Instances? and not included in the confusion matrix, etc., so you will need to calculate the actual false positive rate and true positive rate manually. The diabetes data has 500 positive instances and 268 negative instances, so, given the above confusion matrix, FP rate = 194/268 = 72% and TP rate = 444/500 = 89%. Consequently, FN rate = 1 - TP rate = 11%. The FN rate is called ?rejection rate? in OneClassClassifier and you can adjust the target rejection rate using a parameter (it should perhaps really be called *false* rejection rate, to make the meaning clearer).

It?s probably easier to simply rename ?tested_positive? to ?outlier? in the actual data file so that you don?t have to do manual calculations. The output then looks like this:

java weka.Run .OneClassClassifier -W .MLPClassifier -t diabetes.modified.arff -tcl "tested_negative" -I -- -N 5

?

=== Stratified cross-validation ===

Correctly Classified Instances         525               68.3594 %
Incorrectly Classified Instances       243               31.6406 %
Kappa statistic                          0.2068
EER                                      0.3881
Quadratic Weighted Kappa 0.20682737751181524Mean absolute error                      0.3258
Root mean squared error                  0.5354
Relative absolute error                 71.6878 %
Root relative squared error            112.3329 %
Total Number of Instances              768


=== Detailed Accuracy By Class ===

                 TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
                 0.898    0.716    0.700      0.898    0.787      0.233    0.673     0.780     tested_negative
                 0.284    0.102    0.598      0.284    0.385      0.233    0.673     0.515     outlier
Weighted Avg.    0.684    0.502    0.665      0.684    0.647      0.233    0.673     0.688


=== Confusion Matrix ===

   a   b   <-- classified as
 449  51 |   a = tested_negative
 192  76 |   b = outlier

Note that the result is slightly different, probably because the data is processed differently.

> - The false positive would indicate the percentage of outliers that are positioned within the class, right? If the data is uniformilly distributed, wouldn't that make my classfier overfit the real positive samples?

I agree, you?d have to be careful to avoid overfitting, making sure that the model is sufficiently simple so that the chance of overfitting is reduced.

> After I train my ANN, I test it in a new dataset. This dataset has unseen healthy data and some outliers (but now, they are real samples - 20% of the dataset is real intrusion samples). My questions are:
>
> - If my classifier performs well in the test, could I generalize that the method worked?

Yes, given that you have ground truth for testing, but the size of the test set is obviously also important.

> - What would the false positive and false negative rate represent in this case?

How many outliers are classified as normal traffic and how many cases of normal traffic are classified as outliers respectively.

Cheers,
Eibe



------------------------------

_______________________________________________
Wekalist mailing list
[hidden email]
https://list.waikato.ac.nz/mailman/listinfo/wekalist


End of Wekalist Digest, Vol 171, Issue 23
*****************************************


_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Wekalist Digest, Vol 171, Issue 23

Peter Reutemann
> I have weka-3-8-1 on ubuntu, and no sign of
> weka.classifiers.meta.OneClassClassifier??

You might want to install the oneClassClassifier package. :-)

http://weka.wikispaces.com/Packages

Cheers, Peter
--
Peter Reutemann
Dept. of Computer Science
University of Waikato, NZ
+64 (7) 858-5174
http://www.cms.waikato.ac.nz/~fracpete/
http://www.data-mining.co.nz/
_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Loading...