Strange errors at saving to .arff (versions 3.8 later)

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Strange errors at saving to .arff (versions 3.8 later)

bostjan.vouk
Hi,
I have one question related to saving to .arff format. Recently, I came across one problem of saving dataset to .arff (e.g. from object Instances, with DataSink). If I save data to .arff (when using API) then all minus characters are replaced by ? (instead of -2.3 I get ?2.3). This happens also with GUI versions 3.8 later; if I open dataset that include -2.3 and then save it (e.g. with different name) then I get spoiled dataset (instead of -2.3 I get ?2.3). With GUI and API 3.7, before mentioned, works normally, but I need higher version because of use of RandomForest and FURIA algorithms.
I am aware to change this errors "manually" but it is very annoying and time consuming because I need such a file during constructive induction process where I evaluate features with MDL measure. I take MDL measure from R - I am calling Rscript with RCaller; before evaluation, dataset has to be saved in one file (e.g. .arff, .csv).

I hope that someone could help me or give me a clue to the solution.

Best regards,
Boštjan
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Strange errors at saving to .arff (versions 3.8 later)

Peter Reutemann-3
Can you provide a minimal code example where this problems occurs?
Also, what Java version, operating system etc are you using?

Cheers, Peter

On February 16, 2021 3:02:37 PM GMT+13:00, [hidden email] wrote:

>Hi,
>I have one question related to saving to .arff format. Recently, I came
>across one problem of saving dataset to .arff (e.g. from object
>Instances, with DataSink). If I save data to .arff (when using API)
>then all minus characters are replaced by ? (instead of -2.3 I get
>?2.3). This happens also with GUI versions 3.8 later; if I open dataset
>that include -2.3 and then save it (e.g. with different name) then I
>get spoiled dataset (instead of -2.3 I get ?2.3). With GUI and API 3.7,
>before mentioned, works normally, but I need higher version because of
>use of RandomForest and FURIA algorithms.
>I am aware to change this errors "manually" but it is very annoying and
>time consuming because I need such a file during constructive induction
>process where I evaluate features with MDL measure. I take MDL measure
>from R - I am calling Rscript with RCaller; before evaluation, dataset
>has to be saved in one file (e.g. .arff, .csv).
>
>I hope that someone could help me or give me a clue to the solution.
>
>Best regards,
>Boštjan
>_______________________________________________
>Wekalist mailing list -- [hidden email]
>Send posts to [hidden email]
>To unsubscribe send an email to [hidden email]
>To subscribe, unsubscribe, etc., visit
>https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
>List etiquette:
>http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

--
Peter Reutemann
Dept. of Computer Science
University of Waikato, NZ
+64 (7) 577-5304
http://www.cms.waikato.ac.nz/~fracpete/
http://www.data-mining.co.nz/
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Strange errors at saving to .arff (versions 3.8 later)

bostjan.vouk
Hi,
OS, Java, NetBeans, and Weka version:
Product Version: Apache NetBeans IDE 12.0
Java: 14.0.1; Java HotSpot(TM) 64-Bit Server VM 14.0.1+7
Runtime: Java(TM) SE Runtime Environment 14.0.1+7
System: Windows Server 2012 R2 version 6.3 running on amd64; Cp1250; en_US (nb)

Tested Weka versions: from 3.8.0 to 3.9.5

//test code
File folder = new File("datasets/realDatasets/class");
File[] listOfFiles = folder.listFiles();
        for (File file : listOfFiles){
                if (file.isFile()){
                        Instances data = new Instances(new BufferedReader(new FileReader(file)));
                        data.setClassIndex(data.numAttributes()-1);
                        DataSink.write("testOutput.arff", data);
                        ...

//whole method
public static void mdlCORElearn(Instances data) throws Exception {  //evaluation of the whole dataset
    File output = new File("Rdata/dataForR.arff");
    OutputStream out = new FileOutputStream(output);        
    DataSink.write(out, data);
    out.close();
   
    RCaller rCaller = RCaller.create();
    RCode code = RCode.create();
    code.addRCode("library(CORElearn)");
    code.addRCode("library(RWeka)");
    code.addRCode("dataset <- read.arff(\"Rdata/dataForR.arff\")");
    code.addRCode("estMDL <- attrEval(which(names(dataset) == names(dataset)[length(names(dataset))]), dataset, estimator=\"MDL\",outputNumericSplits=TRUE)");   //last attribute is class attribute
   
    rCaller.setRCode(code);
    rCaller.runAndReturnResultOnline("estMDL");
    String tmpRcall[]=rCaller.getParser().getAsStringArray("attrEval");   //name in R "attrEval", get data from R, evaluated attributes

    Map<String, Double> mapMDL=new TreeMap<String, Double>(Collections.reverseOrder());
    for(int i=0;i<data.numAttributes()-1;i++){
        mapMDL.put(data.attribute(i).name(),Double.parseDouble(tmpRcall[i]));   //we get attribute names from Java (Instances data) and evaluation from R
    }

    LinkedList<Map.Entry<String, Double>> listMDL = new LinkedList<>(mapMDL.entrySet());
    Comparator<Map.Entry<String, Double>> comparator2 = Comparator.comparing(Map.Entry::getValue);
    Collections.sort(listMDL, comparator2.reversed());
    for(Map.Entry<String, Double> me : listMDL){
        //System.out.printf(" %4.4f %s\n",me.getValue(), me.getKey());
            attrImpListMDL.printf(" %4.4f %s\n",me.getValue(), me.getKey());
    }
   
    rCaller.stopRCallerOnline();
    output.delete();//delete temp file
}

The error occurs also when using Weka GUI:
OS Name: Microsoft Windows 10 Pro Education
OS Version: 10.0.18363 N/A Build 18363

Java(TM) SE Runtime Environment (build 1.8.0_241-b07)
Java HotSpot(TM) 64-Bit Server VM (build 25.241-b07, mixed mode)

If I use Weka version 3.7.7 everything works normaly.

Best regards,
Boštjan
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Strange errors at saving to .arff (versions 3.8 later)

Peter Reutemann
> OS, Java, NetBeans, and Weka version:
> Product Version: Apache NetBeans IDE 12.0
> Java: 14.0.1; Java HotSpot(TM) 64-Bit Server VM 14.0.1+7
> Runtime: Java(TM) SE Runtime Environment 14.0.1+7
> System: Windows Server 2012 R2 version 6.3 running on amd64; Cp1250; en_US (nb)
>
> Tested Weka versions: from 3.8.0 to 3.9.5
>
> //test code
> File folder = new File("datasets/realDatasets/class");
> File[] listOfFiles = folder.listFiles();
>         for (File file : listOfFiles){
>                 if (file.isFile()){
>                         Instances data = new Instances(new BufferedReader(new FileReader(file)));
>                         data.setClassIndex(data.numAttributes()-1);
>                         DataSink.write("testOutput.arff", data);
>                         ...
>
> //whole method
> public static void mdlCORElearn(Instances data) throws Exception {  //evaluation of the whole dataset
>     File output = new File("Rdata/dataForR.arff");
>     OutputStream out = new FileOutputStream(output);
>     DataSink.write(out, data);
>     out.close();
>
>     RCaller rCaller = RCaller.create();
>     RCode code = RCode.create();
>     code.addRCode("library(CORElearn)");
>     code.addRCode("library(RWeka)");
>     code.addRCode("dataset <- read.arff(\"Rdata/dataForR.arff\")");
>     code.addRCode("estMDL <- attrEval(which(names(dataset) == names(dataset)[length(names(dataset))]), dataset, estimator=\"MDL\",outputNumericSplits=TRUE)");   //last attribute is class attribute
>
>     rCaller.setRCode(code);
>     rCaller.runAndReturnResultOnline("estMDL");
>     String tmpRcall[]=rCaller.getParser().getAsStringArray("attrEval");   //name in R "attrEval", get data from R, evaluated attributes
>
>     Map<String, Double> mapMDL=new TreeMap<String, Double>(Collections.reverseOrder());
>     for(int i=0;i<data.numAttributes()-1;i++){
>         mapMDL.put(data.attribute(i).name(),Double.parseDouble(tmpRcall[i]));   //we get attribute names from Java (Instances data) and evaluation from R
>     }
>
>     LinkedList<Map.Entry<String, Double>> listMDL = new LinkedList<>(mapMDL.entrySet());
>     Comparator<Map.Entry<String, Double>> comparator2 = Comparator.comparing(Map.Entry::getValue);
>     Collections.sort(listMDL, comparator2.reversed());
>     for(Map.Entry<String, Double> me : listMDL){
>         //System.out.printf(" %4.4f %s\n",me.getValue(), me.getKey());
>             attrImpListMDL.printf(" %4.4f %s\n",me.getValue(), me.getKey());
>     }
>
>     rCaller.stopRCallerOnline();
>     output.delete();//delete temp file
> }
>
> The error occurs also when using Weka GUI:
> OS Name: Microsoft Windows 10 Pro Education
> OS Version: 10.0.18363 N/A Build 18363
>
> Java(TM) SE Runtime Environment (build 1.8.0_241-b07)
> Java HotSpot(TM) 64-Bit Server VM (build 25.241-b07, mixed mode)
>
> If I use Weka version 3.7.7 everything works normaly.

So you're not actually using Weka to generate the data itself, but R.
I've never used a class called "RCaller". Is it from this library?
https://github.com/jbytecode/rcaller

Maybe this library has a problem with Weka's package manager?
Maybe a codepage problem with the string representation? UTF-8 vs
whatever Windows 10 uses?
Have you tried your code on Linux (which uses UTF-8 natively)?

Cheers, Peter
--
Peter Reutemann
Dept. of Computer Science
University of Waikato, NZ
+64 (7) 577-5304
http://www.cms.waikato.ac.nz/~fracpete/
http://www.data-mining.co.nz/
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Strange errors at saving to .arff (versions 3.8 later)

Michael Hall
In reply to this post by bostjan.vouk


On Feb 16, 2021, at 1:27 AM, [hidden email] wrote:

Tested Weka versions: from 3.8.0 to 3.9.5

I tried to reproduce with 3.9.5

import weka.core.*;
import weka.core.converters.ConverterUtils.DataSink;
import java.io.*;
import java.util.ArrayList;

public class Test {

/** Max number of decimal places for numeric values */
   static int m_MaxDecimalPlaces = AbstractInstance.s_numericAfterDecimalPoint;
  
public static void main(String[] args) {
Attribute attr = new Attribute("attr");
ArrayList<Attribute> attrs = new ArrayList();
attrs.add(attr);
Instances test = new Instances("test",attrs,0);
DenseInstance di = new DenseInstance(1);
di.setValue(attr,-2.3);
test.add(di);
// As near as I can tell with a arff file extension
// DataSink will wrap a ArffSaver and output the instances like this
System.out.println(di.toStringMaxDecimalDigits(m_MaxDecimalPlaces));
// to duplicate your results
try {
DataSink.write("temp.arff",test);
}
catch(Exception ex) { ex.printStackTrace(); }
}
}

Getting...

cat temp.arff
@relation test

@attribute attr numeric

@data
-2.3

I get spoiled dataset (instead of -2.3 I get ?2.3)

Seems fine for me. Maybe we need to see more of the arff. Good or bad. 


_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Strange errors at saving to .arff (versions 3.8 later)

bostjan.vouk
Hi,
if I try your code I get.
@relation test

@attribute attr numeric

@data
?2.3

Is there any possibility that I have some different/wrong setting on the server. I am using just WIN server; I didn't try the code on LINUX.
RCaller works fine, the problem which I have is just preprocessing step where I generate/save file with Java and Weka object.

Best regards,
Boštjan
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Strange errors at saving to .arff (versions 3.8 later)

Michael Hall


> On Feb 17, 2021, at 1:26 AM, [hidden email] wrote:
>
> Hi,
> if I try your code I get.
> @relation test
>
> @attribute attr numeric
>
> @data
> ?2.3
>
> Is there any possibility that I have some different/wrong setting on the server. I am using just WIN server; I didn't try the code on LINUX.
> RCaller works fine, the problem which I have is just preprocessing step where I generate/save file with Java and Weka object.

Possibly.

What did the println show?
System.out.println(di.toStringMaxDecimalDigits(m_MaxDecimalPlaces));
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Strange errors at saving to .arff (versions 3.8 later)

bostjan.vouk
Hi,
println prints the same.
System.out.println(di.toStringMaxDecimalDigits(m_MaxDecimalPlaces));
?2.3

If I try some ordinary things as
double tmp=-1.3;
System.out.println(tmp);
System.out.println(-3.4);

I get right numbers.
-1.3
-3.4

The clue must be in Instances object.
I get the same error if I use Weka GUI. If I first upload file (arrhythmia.arff from UCI) in WEKA GUI and then save it with different name happens the same (all minus signs are replaced with ?).

Best regards,
Boštjan
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Strange errors at saving to .arff (versions 3.8 later)

Michael Hall


> On Feb 17, 2021, at 2:09 PM, [hidden email] wrote:
>
>
> The clue must be in Instances object.
> I get the same error if I use Weka GUI. If I first upload file (arrhythmia.arff from UCI) in WEKA GUI and then save it with different name happens the same (all minus signs are replaced with ?).
>

That is sort of strange because the test case println is System.out and not a Weka one. So it almost have to suggest as you say that it is invalid in the Instance and not an error in outputting.

Also sort of odd since - is simple ASCII.

And again sort of strange in that Weka is a very numeric application and no one else is indicating the error.

You could try something like this println in my test case

System.out.println((int)di.toString().charAt(0));

And see if it outputs the expected 45 for minus sign or something else.
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Strange errors at saving to .arff (versions 3.8 later)

Peter Reutemann
In reply to this post by bostjan.vouk
> println prints the same.
> System.out.println(di.toStringMaxDecimalDigits(m_MaxDecimalPlaces));
> ?2.3
>
> If I try some ordinary things as
> double tmp=-1.3;
> System.out.println(tmp);
> System.out.println(-3.4);
>
> I get right numbers.
> -1.3
> -3.4
>
> The clue must be in Instances object.
> I get the same error if I use Weka GUI. If I first upload file (arrhythmia.arff from UCI) in WEKA GUI and then save it with different name happens the same (all minus signs are replaced with ?).

The toStringMaxDecimalDigits method, as implemented in
weka.core.AbstractInstance (eg weka.core.DenseInstance is derived from
that), calls weka.core.Utils.doubleToString in turn. The
doubleToString method uses Java's DecimalFormat class for generating
the actual string, which will take your locale into account. Older
versions of Weka had a locale-independent implementation.
I presume, your default locale is generating these characters. Maybe
you can change the locale for the Java process?
https://stackoverflow.com/a/9894836/4698227

Cheers, Peter
--
Peter Reutemann
Dept. of Computer Science
University of Waikato, NZ
+64 (7) 577-5304
http://www.cms.waikato.ac.nz/~fracpete/
http://www.data-mining.co.nz/
_______________________________________________
Wekalist mailing list -- [hidden email]
Send posts to [hidden email]
To unsubscribe send an email to [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html