[hdf-forum] h5py -- most efficient way to load a hdf5
Andrew Collette
andrew.collette at gmail.com
Tue Jun 23 02:46:09 EDT 2009
Hi,
> I have a very large csv file 14G and I am planning to move all of my
> data to hdf5. I am using h5py to load the data. The biggest problem I
> am having is, I am putting the entire file into memory and then
> creating a dataset from it. This is very inefficient and it takes over
> 4 hours to create the hdf5 file.
>
> The csv file has various types:
> int4, int4, str, str, str, str, str
Since you're using Python you should investigate the functions
numpy.fromfile and numpy.loadtxt. The biggest thing you should worry
about is finding a way to iterate over rows in the input file. You
can create an HDF5 dataset with the proper size and dtype, and then
fill it in row by row as you read records in from the csv file. That
way you avoid having to load the entire file into memory.
As far as the datatypes, if all the rows of your CSV have the same
fields, the dtype for the HDF5 file should be something like:
vl_str = h5py.new_vlen(str)
mydtype = numpy.dtype([('Field1', 'i4'), ('Field2', 'i4'), ('Field3',
vl_str), ('Field4', vl_str). ... <the rest> ...])
This strategy will create an HDF5 dataset whose elements are a
compound type with two integers and four variable-length strings.
HTH,
Andrew
----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe at hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe at hdfgroup.org.
More information about the Hdf-forum
mailing list