[hdf-forum] h5py -- most efficient way to load a hdf5

Peter Alexander vel.accel at gmail.com
Wed Jun 24 08:54:38 EDT 2009


On Wed, Jun 24, 2009 at 7:36 AM, Mag Gam<magawake at gmail.com> wrote:
> Here is some code I have,
>
>
> import numpy as np
> from numpy import *
>
> import gzip
> import h5py
> import re
> import sys, string, time, getopt
> import os
>
> src=sys.argv[1]
> fs = gzip.open(src)
> x=src.split("/")
> filename=x[len(x)-1]
>
> #Get YYYY/MM/DD format
> YYYY=(filename.rsplit(".",2)[0])[0:4]
> MM=(filename.rsplit(".",2)[0])[4:6]
> DD=(filename.rsplit(".",2)[0])[6:8]
>
> f=h5py.File('/tmp/test_foo/FE.hdf5','w')
>
> grp="/"+YYYY
> try:
>  f.create_group(grp)
> except ValueError:
>  print "Year group already exists"
>
> grp=grp+"/"+MM
> try:
>  f.create_group(grp)
> except ValueError:
>  print "Month group already exists"
>
> grp=grp+"/"+DD
> try:
>  group=f.create_group(grp)
> except ValueError:
>  print "Day group already exists"
>
>
> str_type=h5py.new_vlen(str)
> mydescriptor = {'names': ('gender','age','weight'), 'formats': ('S1',
> 'f4', 'f4')}
> print "Filename is: ",src
> fs = gzip.open(src)
>
> dset = f.create_dataset ('Foo',data=arr,compression='gzip')
>
> s=0
> for y in fs:
>     continue
>  a=y.split(',')
>  s=s+1
>  dset.resize(s,axis=0)
> fs.close()
>
>
> f.close()
>
>
> This works but just takes a VERY long time.
>
>
> On Tue, Jun 23, 2009 at 4:32 AM, Mag Gam<magawake at gmail.com> wrote:
>> Thanks for the response
>>
>>
>> On Tue, Jun 23, 2009 at 2:46 AM, Andrew
>> Collette<andrew.collette at gmail.com> wrote:
>>> Hi,
>>>
>>>> I have a very large csv file 14G and I am planning to move all of my
>>>> data to hdf5. I am using h5py to load the data. The biggest problem I
>>>> am having is, I am putting the entire file into memory and then
>>>> creating a dataset from it. This is very inefficient and it takes over
>>>> 4 hours to create the hdf5 file.
>>>>
>>>> The csv file has various types:
>>>> int4, int4, str, str, str, str, str
>>>
>>> Since you're using Python you should investigate the functions
>>> numpy.fromfile and numpy.loadtxt.  The biggest thing you should worry
>>> about is finding a way to iterate over rows in the input file.  You
>>> can create an HDF5 dataset with the proper size and dtype, and then
>>> fill it in row by row as you read records in from the csv file.  That
>>> way you avoid having to load the entire file into memory.
>>
>> Correct, this is the way I am trying to do it but do I have to worry
>> about resize? Because each file has different number of rows.
>>
>> The numpy.from file and loadtxt actually load everything into memory.
>> Thats the way I am doing it now, and its very inefficient.
>>
>> Do you have any sample code for line by line read and then push it to
>> hdf5 file?
>>
>>
>>
>>>
>>> As far as the datatypes, if all the rows of your CSV have the same
>>> fields, the dtype for the HDF5 file should be something like:
>>>
>>> vl_str = h5py.new_vlen(str)
>>> mydtype = numpy.dtype([('Field1', 'i4'), ('Field2', 'i4'), ('Field3',
>>> vl_str), ('Field4', vl_str). ... <the rest> ...])
>>>
>>> This strategy will create an HDF5 dataset whose elements are a
>>> compound type with two integers and four variable-length strings.
>>>
>>> HTH,
>>> Andrew
>>>
>>> ----------------------------------------------------------------------
>>> This mailing list is for HDF software users discussion.
>>> To subscribe to this list, send a message to hdf-forum-subscribe at hdfgroup.org.
>>> To unsubscribe, send a message to hdf-forum-unsubscribe at hdfgroup.org.
>>>
>>>
>>
>
> ----------------------------------------------------------------------
> This mailing list is for HDF software users discussion.
> To subscribe to this list, send a message to hdf-forum-subscribe at hdfgroup.org.
> To unsubscribe, send a message to hdf-forum-unsubscribe at hdfgroup.org.
>
>

You're code can certainly be made more efficient. Python is
notoriously slow when using 'for loops'. Numpy has routines for you to
load ascii/csv into a numpy array and then to hdf5. Consider using
chunks and not the whole file in a single shot.

You might also be able to use 'h5import' which is one the command line
utilities available at the hdf web site.

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe at hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe at hdfgroup.org.





More information about the Hdf-forum mailing list