code contribution for reading gzipped arff files

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

code contribution for reading gzipped arff files

Michael Dittenbach
hi!

in order to save disk space when working with many large data sets (yes, I
know that disk sizes are increasing ... but so are data set sizes ;-), we
offer a patch for reading gzipped arff files. maybe this idea will be
incorporated into future releases of weka.

2 files have been patched: weka/gui/explorer/PreprocessPanel.java when
reading input files via the Explorer GUI and
weka/core/converters/ArffLoader.java


diff for weka/gui/explorer/PreprocessPanel.java (weka version 3-4-4):
---- cut here ----
36a37
> import java.util.zip.GZIPInputStream;
157a159,162
>   protected ExtensionFileFilter m_gzArffFileFilter =
>     new ExtensionFileFilter(Instances.FILE_EXTENSION+".gz",
>                           "GZipped Arff data files");
>
249a255,256
>     m_FileChooser.
>         addChoosableFileFilter(m_gzArffFileFilter);
1035c1042,1043
<           if (f.getName().toLowerCase().endsWith(Instances.FILE_EXTENSION)) {
---
>           if ((f.getName().toLowerCase().endsWith(Instances.FILE_EXTENSION)) ||
>               (f.getName().toLowerCase().endsWith(Instances.FILE_EXTENSION+".gz"))) {
1037c1045,1052
<             Reader r = new BufferedReader(new FileReader(f));
---
>
>             Reader r = null;
>             try {
>                  r = new BufferedReader(new InputStreamReader(new GZIPInputStream(new FileInputStream(f))));
>             } catch (IOException e) {
>                  r = new BufferedReader(new FileReader(f));
>             }
>
---- cut here ----


diff for weka/core/converters/ArffLoader.java (weka version 3-4-4):
---- cut here ----
30a31
> import java.util.zip.GZIPInputStream;
154c155,159
<     m_sourceReader = new BufferedReader(new InputStreamReader(in));
---
>     try {
>        m_sourceReader = new BufferedReader(new InputStreamReader(new GZIPInputStream(in)));
>     } catch (IOException e) {
>        m_sourceReader = new BufferedReader(new InputStreamReader(in));
>     }
---- cut here ----


cheers
michael, helmut

--
michael dittenbach, helmut berger
e-commerce competence center - ec3
vienna, austria

_______________________________________________
Wekalist mailing list
[hidden email]
https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
Reply | Threaded
Open this post in threaded view
|

Re: code contribution for reading gzipped arff files

JRijnberk
Hi
Short reaction:
Wonderfull contributiin. Will be appreciated by many.
Consider:
We experience that zipped files are often read faster then plain file (I/O
is slow while processor/memory actions are fast)

Hans





At 12:15 PM 02/06/2005 +0200, Michael Dittenbach wrote:

>hi!
>
>in order to save disk space when working with many large data sets (yes, I
>know that disk sizes are increasing ... but so are data set sizes ;-), we
>offer a patch for reading gzipped arff files. maybe this idea will be
>incorporated into future releases of weka.
>
>2 files have been patched: weka/gui/explorer/PreprocessPanel.java when
>reading input files via the Explorer GUI and
>weka/core/converters/ArffLoader.java
>
>
>diff for weka/gui/explorer/PreprocessPanel.java (weka version 3-4-4):
>---- cut here ----
>36a37
>> import java.util.zip.GZIPInputStream;
>157a159,162
>>   protected ExtensionFileFilter m_gzArffFileFilter =
>>     new ExtensionFileFilter(Instances.FILE_EXTENSION+".gz",
>>                           "GZipped Arff data files");
>>
>249a255,256
>>     m_FileChooser.
>>         addChoosableFileFilter(m_gzArffFileFilter);
>1035c1042,1043
><           if (f.getName().toLowerCase().endsWith(Instances.FILE_EXTENSION)) {
>---
>>           if
((f.getName().toLowerCase().endsWith(Instances.FILE_EXTENSION)) ||
>>
(f.getName().toLowerCase().endsWith(Instances.FILE_EXTENSION+".gz"))) {
>1037c1045,1052
><             Reader r = new BufferedReader(new FileReader(f));
>---
>>
>>             Reader r = null;
>>             try {
>>                  r = new BufferedReader(new InputStreamReader(new
GZIPInputStream(new FileInputStream(f))));

>>             } catch (IOException e) {
>>                  r = new BufferedReader(new FileReader(f));
>>             }
>>
>---- cut here ----
>
>
>diff for weka/core/converters/ArffLoader.java (weka version 3-4-4):
>---- cut here ----
>30a31
>> import java.util.zip.GZIPInputStream;
>154c155,159
><     m_sourceReader = new BufferedReader(new InputStreamReader(in));
>---
>>     try {
>>        m_sourceReader = new BufferedReader(new InputStreamReader(new
GZIPInputStream(in)));

>>     } catch (IOException e) {
>>        m_sourceReader = new BufferedReader(new InputStreamReader(in));
>>     }
>---- cut here ----
>
>
>cheers
>michael, helmut
>
>--
>michael dittenbach, helmut berger
>e-commerce competence center - ec3
>vienna, austria
>
>_______________________________________________
>Wekalist mailing list
>[hidden email]
>https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
>
>


Hans van Rijnberk

[hidden email]



_______________________________________________
Wekalist mailing list
[hidden email]
https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
Reply | Threaded
Open this post in threaded view
|

RE: code contribution for reading gzipped arff files

subrat
In reply to this post by Michael Dittenbach
Looks like a good idea..But then will this be able to read directly from zipped files in a format understood by weka?

-----Original Message-----
From: [hidden email]
[mailto:[hidden email]]On Behalf Of Hans van
Rijnberk , Assort Vision, Utrecht
Sent: Thursday, June 02, 2005 3:57 PM
To: Michael Dittenbach; [hidden email]
Subject: Re: [Wekalist] code contribution for reading gzipped arff files


Hi
Short reaction:
Wonderfull contributiin. Will be appreciated by many.
Consider:
We experience that zipped files are often read faster then plain file (I/O
is slow while processor/memory actions are fast)

Hans





At 12:15 PM 02/06/2005 +0200, Michael Dittenbach wrote:

>hi!
>
>in order to save disk space when working with many large data sets (yes, I
>know that disk sizes are increasing ... but so are data set sizes ;-), we
>offer a patch for reading gzipped arff files. maybe this idea will be
>incorporated into future releases of weka.
>
>2 files have been patched: weka/gui/explorer/PreprocessPanel.java when
>reading input files via the Explorer GUI and
>weka/core/converters/ArffLoader.java
>
>
>diff for weka/gui/explorer/PreprocessPanel.java (weka version 3-4-4):
>---- cut here ----
>36a37
>> import java.util.zip.GZIPInputStream;
>157a159,162
>>   protected ExtensionFileFilter m_gzArffFileFilter =
>>     new ExtensionFileFilter(Instances.FILE_EXTENSION+".gz",
>>                           "GZipped Arff data files");
>>
>249a255,256
>>     m_FileChooser.
>>         addChoosableFileFilter(m_gzArffFileFilter);
>1035c1042,1043
><           if (f.getName().toLowerCase().endsWith(Instances.FILE_EXTENSION)) {
>---
>>           if
((f.getName().toLowerCase().endsWith(Instances.FILE_EXTENSION)) ||
>>
(f.getName().toLowerCase().endsWith(Instances.FILE_EXTENSION+".gz"))) {
>1037c1045,1052
><             Reader r = new BufferedReader(new FileReader(f));
>---
>>
>>             Reader r = null;
>>             try {
>>                  r = new BufferedReader(new InputStreamReader(new
GZIPInputStream(new FileInputStream(f))));

>>             } catch (IOException e) {
>>                  r = new BufferedReader(new FileReader(f));
>>             }
>>
>---- cut here ----
>
>
>diff for weka/core/converters/ArffLoader.java (weka version 3-4-4):
>---- cut here ----
>30a31
>> import java.util.zip.GZIPInputStream;
>154c155,159
><     m_sourceReader = new BufferedReader(new InputStreamReader(in));
>---
>>     try {
>>        m_sourceReader = new BufferedReader(new InputStreamReader(new
GZIPInputStream(in)));

>>     } catch (IOException e) {
>>        m_sourceReader = new BufferedReader(new InputStreamReader(in));
>>     }
>---- cut here ----
>
>
>cheers
>michael, helmut
>
>--
>michael dittenbach, helmut berger
>e-commerce competence center - ec3
>vienna, austria
>
>_______________________________________________
>Wekalist mailing list
>[hidden email]
>https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
>
>


Hans van Rijnberk

[hidden email]



_______________________________________________
Wekalist mailing list
[hidden email]
https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist

_______________________________________________
Wekalist mailing list
[hidden email]
https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
Reply | Threaded
Open this post in threaded view
|

Re: code contribution for reading gzipped arff files

Eibe Frank
In reply to this post by Michael Dittenbach
Sounds like a good idea. However, when I apply your patch to
ArffLoader, loading standard (i.e. non-gzipped) ARFF files no longer
works. I guess the problem is that the same InputStream object is
re-used in the catch statement but it's state has changed in the try
statement.

Cheers,
Eibe

On Jun 2, 2005, at 10:15 PM, Michael Dittenbach wrote:

> hi!
>
> in order to save disk space when working with many large data sets
> (yes, I know that disk sizes are increasing ... but so are data set
> sizes ;-), we offer a patch for reading gzipped arff files. maybe this
> idea will be incorporated into future releases of weka.
>
> 2 files have been patched: weka/gui/explorer/PreprocessPanel.java when
> reading input files via the Explorer GUI and
> weka/core/converters/ArffLoader.java
>
>
> diff for weka/gui/explorer/PreprocessPanel.java (weka version 3-4-4):
> ---- cut here ----
> 36a37
>> import java.util.zip.GZIPInputStream;
> 157a159,162
>>   protected ExtensionFileFilter m_gzArffFileFilter =
>>     new ExtensionFileFilter(Instances.FILE_EXTENSION+".gz",
>>                           "GZipped Arff data files");
>>
> 249a255,256
>>     m_FileChooser.
>>         addChoosableFileFilter(m_gzArffFileFilter);
> 1035c1042,1043
> <           if
> (f.getName().toLowerCase().endsWith(Instances.FILE_EXTENSION)) {
> ---
>>           if
>> ((f.getName().toLowerCase().endsWith(Instances.FILE_EXTENSION)) ||
>>              
>> (f.getName().toLowerCase().endsWith(Instances.FILE_EXTENSION+".gz")))
>> {
> 1037c1045,1052
> <             Reader r = new BufferedReader(new FileReader(f));
> ---
>>
>>             Reader r = null;
>>             try {
>>                  r = new BufferedReader(new InputStreamReader(new
>> GZIPInputStream(new FileInputStream(f))));
>>             } catch (IOException e) {
>>                  r = new BufferedReader(new FileReader(f));
>>             }
>>
> ---- cut here ----
>
>
> diff for weka/core/converters/ArffLoader.java (weka version 3-4-4):
> ---- cut here ----
> 30a31
>> import java.util.zip.GZIPInputStream;
> 154c155,159
> <     m_sourceReader = new BufferedReader(new InputStreamReader(in));
> ---
>>     try {
>>        m_sourceReader = new BufferedReader(new InputStreamReader(new
>> GZIPInputStream(in)));
>>     } catch (IOException e) {
>>        m_sourceReader = new BufferedReader(new InputStreamReader(in));
>>     }
> ---- cut here ----
>
>
> cheers
> michael, helmut
>
> --
> michael dittenbach, helmut berger
> e-commerce competence center - ec3
> vienna, austria
>
> _______________________________________________
> Wekalist mailing list
> [hidden email]
> https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist


_______________________________________________
Wekalist mailing list
[hidden email]
https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
Reply | Threaded
Open this post in threaded view
|

Re: code contribution for reading gzipped arff files

Michael Dittenbach
hi!

> Sounds like a good idea. However, when I apply your patch to ArffLoader,
> loading standard (i.e. non-gzipped) ARFF files no longer works. I guess the
> problem is that the same InputStream object is re-used in the catch statement
> but it's state has changed in the try statement.

hmm ... strange, because it works for us. First, we try to open a file via
GZIPInputStream. An IOException is thrown if the file is not in GZIP
format. In this case we try to open it the usual way. If the
whole code fragment is nested into another try-catch block, exceptions
such as a FileNotFoundException, can be handled.

try {
   input = new BufferedReader(new InputStreamReader(
               new GZIPInputStream(new FileInputStream(infileName))));
} catch (IOException e) {
   input = new BufferedReader(new FileReader(infileName));
}

Our first thought was that ArffLoader is used throughout the code for
reading ARFF files, but that's not the case.

The above piece of code has to be inserted into other places in the weka
code as well (e.g. weka.filters.Filter [3 times],
weka.classifiers.Evaluation [3 times]), because ARFF files are usually
opened by calling input=new BufferedReader(...). Hence, some method
encapsulating the functionality of opening plain and gzipped ARFF files
would avoid redundant use of this code. So, one does not have to
care about whether the file is gzipped or not.

e.g.:

static BufferedReader openArffFile(String fName) throws IOException (
   BufferedReader br;
   try {
     br = new BufferedReader(new InputStreamReader(
               new GZIPInputStream(new FileInputStream(fName))));
   } catch (IOException e) {
     br = new BufferedReader(new FileReader(fName));
   }
   return br;
}


cheers
michael


>
> Cheers,
> Eibe
>
> On Jun 2, 2005, at 10:15 PM, Michael Dittenbach wrote:
>
>> hi!
>>
>> in order to save disk space when working with many large data sets (yes, I
>> know that disk sizes are increasing ... but so are data set sizes ;-), we
>> offer a patch for reading gzipped arff files. maybe this idea will be
>> incorporated into future releases of weka.
>>
>> 2 files have been patched: weka/gui/explorer/PreprocessPanel.java when
>> reading input files via the Explorer GUI and
>> weka/core/converters/ArffLoader.java
>>
>>
>> diff for weka/gui/explorer/PreprocessPanel.java (weka version 3-4-4):
>> ---- cut here ----
>> 36a37
>>> import java.util.zip.GZIPInputStream;
>> 157a159,162
>>>   protected ExtensionFileFilter m_gzArffFileFilter =
>>>     new ExtensionFileFilter(Instances.FILE_EXTENSION+".gz",
>>>                           "GZipped Arff data files");
>>>
>> 249a255,256
>>>     m_FileChooser.
>>>         addChoosableFileFilter(m_gzArffFileFilter);
>> 1035c1042,1043
>> <           if
>> (f.getName().toLowerCase().endsWith(Instances.FILE_EXTENSION)) {
>> ---
>>>           if
>>> ((f.getName().toLowerCase().endsWith(Instances.FILE_EXTENSION)) ||
>>>               (f.getName().toLowerCase().endsWith(Instances.FILE_EXTENSION+".gz")))
>>> {
>> 1037c1045,1052
>> <             Reader r = new BufferedReader(new FileReader(f));
>> ---
>>>
>>>             Reader r = null;
>>>             try {
>>>                  r = new BufferedReader(new InputStreamReader(new
>>> GZIPInputStream(new FileInputStream(f))));
>>>             } catch (IOException e) {
>>>                  r = new BufferedReader(new FileReader(f));
>>>             }
>>>
>> ---- cut here ----
>>
>>
>> diff for weka/core/converters/ArffLoader.java (weka version 3-4-4):
>> ---- cut here ----
>> 30a31
>>> import java.util.zip.GZIPInputStream;
>> 154c155,159
>> <     m_sourceReader = new BufferedReader(new InputStreamReader(in));
>> ---
>>>     try {
>>>        m_sourceReader = new BufferedReader(new InputStreamReader(new
>>> GZIPInputStream(in)));
>>>     } catch (IOException e) {
>>>        m_sourceReader = new BufferedReader(new InputStreamReader(in));
>>>     }
>> ---- cut here ----
>>
>>
>> cheers
>> michael, helmut
>>
>> --
>> michael dittenbach, helmut berger
>> e-commerce competence center - ec3
>> vienna, austria
>>
>> _______________________________________________
>> Wekalist mailing list
>> [hidden email]
>> https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
>

_______________________________________________
Wekalist mailing list
[hidden email]
https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist