Customized CSVLoader schema?

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Customized CSVLoader schema?

Robert Bates (Octobang)
Hi there!

I'm working with a rather large CSV file that I already know the full
schema for (nominal values, types, etc) and would like to initialize a
CSVLoader with the schema instead of having it auto-scan.  Converting to
ARFF is not an option due to the fact this file is sourced from an
automated process and not all nominal values will always appear in a
given file.

Is there an idiomatic way to provide a custom schema to CSVLoader?  I've
tried overriding  the getStructure() method but it appears that there is
a lot of under-the-hood instance property manipulation going on in the
CSVLoader.readHeader() method called by CSVLoader.getStructure() and I
keep getting all kinds of exceptions based on assumptions those values
are populated by readHeader() auto-scanning.

Thanks!

--

==========
Robert Bates
Octobang
[hidden email]
https://keybase.io/arpieb

_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Customized CSVLoader schema?

Eibe Frank-2
Administrator
How about the Java code version of something like the following command:

  java weka.Run .CSVLoader weather.nominal.csv -B 1 -N first-last -L outlook:sunny,overcast,rainy -L temperature:hot,mild,cool -L humidity:high,normal -L windy:FALSE,TRUE -L play:no,yes

which can be run on the CSV version of weather.nominal.arff included in the data folder of the WEKA distribution. The CSV version can be created using

  java weka.Run .CSVSaver -i ~/weka-3-8-3/data/weather.nominal.arff -o weather.nominal.csv

If my understanding of the CSVLoader is correct, the above command-line will only read the first row of data twice. The rest will only be read once because the buffer size is set to 1 and the header info will be updated from the first batch only.

In Java code, you may even be able to avoid reading the first row of data twice by setting the buffer size to zero before calling getStructure() and setting it to some positive value afterwards.

Cheers,
Eibe

> On 27/08/2019, at 1:53 AM, Robert Bates (Octobang) <[hidden email]> wrote:
>
> Hi there!
>
> I'm working with a rather large CSV file that I already know the full schema for (nominal values, types, etc) and would like to initialize a CSVLoader with the schema instead of having it auto-scan.  Converting to ARFF is not an option due to the fact this file is sourced from an automated process and not all nominal values will always appear in a given file.
>
> Is there an idiomatic way to provide a custom schema to CSVLoader?  I've tried overriding  the getStructure() method but it appears that there is a lot of under-the-hood instance property manipulation going on in the CSVLoader.readHeader() method called by CSVLoader.getStructure() and I keep getting all kinds of exceptions based on assumptions those values are populated by readHeader() auto-scanning.
>
> Thanks!
>
> --
>
> ==========
> Robert Bates
> Octobang
> [hidden email]
> https://keybase.io/arpieb
>
> _______________________________________________
> Wekalist mailing list
> Send posts to: [hidden email]
> To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
> List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html