[hdf-forum] h5py -- most efficient way to load a hdf5
Mag Gam
magawake at gmail.com
Wed Jun 24 19:19:55 EDT 2009
Great. Thankyou all for the advice.
I will try this out.
On Wed, Jun 24, 2009 at 2:49 PM, Andrew
Collette<andrew.collette at gmail.com> wrote:
> Hi,
>
>> Here is some code I have,
>
> There are a couple of ways you can speed this up. First, resizing a
> dataset in HDF5 can be expensive, especially since you're doing it for
> each line you read. You will have more success if you create a
> dataset with enough "rows" to begin with and then adjust the size as
> necessary:
>
> ds = myfile.create_dataset("ds", (NROWS,), mydtype, compression='gzip')
>
> Second, you can try using a lower compression ratio, like
> "compression=1", and see if that helps. You may even be able to avoid
> using compression altogether, since the HDF5 format is more efficient
> than CSV.
>
> Third (as Peter Alexander mentioned) is that it's almost certainly
> more efficient to read in your CSV in chunks. I'm not personally
> familiar with the NumPy methods for reading in CSV data, but
> pseudocode for this would be:
>
> ds = myfile.create_dataset("ds", (NROWS,), mydtype, compression=1)
>
> offset = 0
> for each group of 100 lines in the file:
> arr = (use NumPy to load a (100,) chunk of data from file)
> if offset + 100 > NROWS:
> ds.resize(offset+100, axis=0)
> ds[offset:offset+100] = arr
>
> ds.resize(<final row count>, axis=0)
> myfile.close()
>
> Last, (although it won't affect performance) you can replace the pattern:
>
> try:
> group=f.create_group(grp)
> except ValueError:
> print "Day group already exists"
>
> with this:
> group = f.require_group(grp).
>
> Since you have your input data in a gzipped csv file, it would also be
> instructive simply to run:
> $ time gzunzip -c myfile.csv.gz > /dev/null
> and see how much of your time is spent simply unzipping the input file. :)
>
> Andrew
>
> ----------------------------------------------------------------------
> This mailing list is for HDF software users discussion.
> To subscribe to this list, send a message to hdf-forum-subscribe at hdfgroup.org.
> To unsubscribe, send a message to hdf-forum-unsubscribe at hdfgroup.org.
>
>
----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe at hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe at hdfgroup.org.
More information about the Hdf-forum
mailing list