[hdf-forum] h5py -- most efficient way to load a hdf5

Mag Gam magawake at gmail.com
Thu Jun 25 00:02:01 EDT 2009


Also.

ds[offset:offset+100] = arr

that will load the entire set into memory which could be costly...

On Wed, Jun 24, 2009 at 11:26 PM, Mag Gam<magawake at gmail.com> wrote:
> Andrew:
>
> I am having terrible hard time loading a csv file into an array in general.
>
> For example:
> zcat file.csv.gz
> red,blue,1,2,3
> green,orange,3,2,1
> blue,black,2,1,3
>
>
> I would like to put this into a 3x5 matrix with these types, "string,
> string, int4, int4, float 4"
>
>
> Any ideas? I looked thru the examples but there aren't any for various
> types of data and reading a line.
>
> Here is what I have so far:
>
> my mydtype={'names': ('color0','color1','num1','num2','num3','num4'),
> 'formats': ('S8','S8','i4','i4','f4')}
> ds = f.create_dataset ("Foo",(5,),compression=1, dtype=mdtype)
>
> for s, row in enumerate(reader):
>  ds[s]=row  #Does not work
> f.close
>
>
> Any ideas on how I can place this file into ds? In addition, I would
> like to use your optimization.
>
>
>
> On Wed, Jun 24, 2009 at 7:19 PM, Mag Gam<magawake at gmail.com> wrote:
>> Great. Thankyou all for the advice.
>>
>> I will try this out.
>>
>>
>>
>> On Wed, Jun 24, 2009 at 2:49 PM, Andrew
>> Collette<andrew.collette at gmail.com> wrote:
>>> Hi,
>>>
>>>> Here is some code I have,
>>>
>>> There are a couple of ways you can speed this up.  First, resizing a
>>> dataset in HDF5 can be expensive, especially since you're doing it for
>>> each line you read.  You will have more success if you create a
>>> dataset with enough "rows" to begin with and then adjust the size as
>>> necessary:
>>>
>>> ds = myfile.create_dataset("ds", (NROWS,), mydtype, compression='gzip')
>>>
>>> Second, you can try using a lower compression ratio, like
>>> "compression=1", and see if that helps.  You may even be able to avoid
>>> using compression altogether, since the HDF5 format is more efficient
>>> than CSV.
>>>
>>> Third (as Peter Alexander mentioned) is that it's almost certainly
>>> more efficient to read in your CSV in chunks.  I'm not personally
>>> familiar with the NumPy methods for reading in CSV data, but
>>> pseudocode for this would be:
>>>
>>> ds = myfile.create_dataset("ds", (NROWS,), mydtype, compression=1)
>>>
>>> offset = 0
>>> for each group of 100 lines in the file:
>>>    arr = (use NumPy to load a (100,) chunk of data from file)
>>>    if offset + 100 > NROWS:
>>>        ds.resize(offset+100, axis=0)
>>>    ds[offset:offset+100] = arr
>>>
>>> ds.resize(<final row count>, axis=0)
>>> myfile.close()
>>>
>>> Last, (although it won't affect performance) you can replace the pattern:
>>>
>>> try:
>>>  group=f.create_group(grp)
>>> except ValueError:
>>>  print "Day group already exists"
>>>
>>> with this:
>>> group = f.require_group(grp).
>>>
>>> Since you have your input data in a gzipped csv file, it would also be
>>> instructive simply to run:
>>> $ time gzunzip -c myfile.csv.gz > /dev/null
>>> and see how much of your time is spent simply unzipping the input file. :)
>>>
>>> Andrew
>>>
>>> ----------------------------------------------------------------------
>>> This mailing list is for HDF software users discussion.
>>> To subscribe to this list, send a message to hdf-forum-subscribe at hdfgroup.org.
>>> To unsubscribe, send a message to hdf-forum-unsubscribe at hdfgroup.org.
>>>
>>>
>>
>

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe at hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe at hdfgroup.org.





More information about the Hdf-forum mailing list