[hdf-forum] h5py -- most efficient way to load a hdf5

Mag Gam magawake at gmail.com
Wed Jun 24 23:26:42 EDT 2009


Andrew:

I am having terrible hard time loading a csv file into an array in general.

For example:
zcat file.csv.gz
red,blue,1,2,3
green,orange,3,2,1
blue,black,2,1,3


I would like to put this into a 3x5 matrix with these types, "string,
string, int4, int4, float 4"


Any ideas? I looked thru the examples but there aren't any for various
types of data and reading a line.

Here is what I have so far:

my mydtype={'names': ('color0','color1','num1','num2','num3','num4'),
'formats': ('S8','S8','i4','i4','f4')}
ds = f.create_dataset ("Foo",(5,),compression=1, dtype=mdtype)

for s, row in enumerate(reader):
  ds[s]=row  #Does not work
f.close


Any ideas on how I can place this file into ds? In addition, I would
like to use your optimization.



On Wed, Jun 24, 2009 at 7:19 PM, Mag Gam<magawake at gmail.com> wrote:
> Great. Thankyou all for the advice.
>
> I will try this out.
>
>
>
> On Wed, Jun 24, 2009 at 2:49 PM, Andrew
> Collette<andrew.collette at gmail.com> wrote:
>> Hi,
>>
>>> Here is some code I have,
>>
>> There are a couple of ways you can speed this up.  First, resizing a
>> dataset in HDF5 can be expensive, especially since you're doing it for
>> each line you read.  You will have more success if you create a
>> dataset with enough "rows" to begin with and then adjust the size as
>> necessary:
>>
>> ds = myfile.create_dataset("ds", (NROWS,), mydtype, compression='gzip')
>>
>> Second, you can try using a lower compression ratio, like
>> "compression=1", and see if that helps.  You may even be able to avoid
>> using compression altogether, since the HDF5 format is more efficient
>> than CSV.
>>
>> Third (as Peter Alexander mentioned) is that it's almost certainly
>> more efficient to read in your CSV in chunks.  I'm not personally
>> familiar with the NumPy methods for reading in CSV data, but
>> pseudocode for this would be:
>>
>> ds = myfile.create_dataset("ds", (NROWS,), mydtype, compression=1)
>>
>> offset = 0
>> for each group of 100 lines in the file:
>>    arr = (use NumPy to load a (100,) chunk of data from file)
>>    if offset + 100 > NROWS:
>>        ds.resize(offset+100, axis=0)
>>    ds[offset:offset+100] = arr
>>
>> ds.resize(<final row count>, axis=0)
>> myfile.close()
>>
>> Last, (although it won't affect performance) you can replace the pattern:
>>
>> try:
>>  group=f.create_group(grp)
>> except ValueError:
>>  print "Day group already exists"
>>
>> with this:
>> group = f.require_group(grp).
>>
>> Since you have your input data in a gzipped csv file, it would also be
>> instructive simply to run:
>> $ time gzunzip -c myfile.csv.gz > /dev/null
>> and see how much of your time is spent simply unzipping the input file. :)
>>
>> Andrew
>>
>> ----------------------------------------------------------------------
>> This mailing list is for HDF software users discussion.
>> To subscribe to this list, send a message to hdf-forum-subscribe at hdfgroup.org.
>> To unsubscribe, send a message to hdf-forum-unsubscribe at hdfgroup.org.
>>
>>
>

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe at hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe at hdfgroup.org.





More information about the Hdf-forum mailing list