[hdf-forum] h5py -- most efficient way to load a hdf5
Mag Gam
magawake at gmail.com
Wed Jun 24 07:36:49 EDT 2009
Here is some code I have,
import numpy as np
from numpy import *
import gzip
import h5py
import re
import sys, string, time, getopt
import os
src=sys.argv[1]
fs = gzip.open(src)
x=src.split("/")
filename=x[len(x)-1]
#Get YYYY/MM/DD format
YYYY=(filename.rsplit(".",2)[0])[0:4]
MM=(filename.rsplit(".",2)[0])[4:6]
DD=(filename.rsplit(".",2)[0])[6:8]
f=h5py.File('/tmp/test_foo/FE.hdf5','w')
grp="/"+YYYY
try:
f.create_group(grp)
except ValueError:
print "Year group already exists"
grp=grp+"/"+MM
try:
f.create_group(grp)
except ValueError:
print "Month group already exists"
grp=grp+"/"+DD
try:
group=f.create_group(grp)
except ValueError:
print "Day group already exists"
str_type=h5py.new_vlen(str)
mydescriptor = {'names': ('gender','age','weight'), 'formats': ('S1',
'f4', 'f4')}
print "Filename is: ",src
fs = gzip.open(src)
dset = f.create_dataset ('Foo',data=arr,compression='gzip')
s=0
for y in fs:
continue
a=y.split(',')
s=s+1
dset.resize(s,axis=0)
fs.close()
f.close()
This works but just takes a VERY long time.
On Tue, Jun 23, 2009 at 4:32 AM, Mag Gam<magawake at gmail.com> wrote:
> Thanks for the response
>
>
> On Tue, Jun 23, 2009 at 2:46 AM, Andrew
> Collette<andrew.collette at gmail.com> wrote:
>> Hi,
>>
>>> I have a very large csv file 14G and I am planning to move all of my
>>> data to hdf5. I am using h5py to load the data. The biggest problem I
>>> am having is, I am putting the entire file into memory and then
>>> creating a dataset from it. This is very inefficient and it takes over
>>> 4 hours to create the hdf5 file.
>>>
>>> The csv file has various types:
>>> int4, int4, str, str, str, str, str
>>
>> Since you're using Python you should investigate the functions
>> numpy.fromfile and numpy.loadtxt. The biggest thing you should worry
>> about is finding a way to iterate over rows in the input file. You
>> can create an HDF5 dataset with the proper size and dtype, and then
>> fill it in row by row as you read records in from the csv file. That
>> way you avoid having to load the entire file into memory.
>
> Correct, this is the way I am trying to do it but do I have to worry
> about resize? Because each file has different number of rows.
>
> The numpy.from file and loadtxt actually load everything into memory.
> Thats the way I am doing it now, and its very inefficient.
>
> Do you have any sample code for line by line read and then push it to
> hdf5 file?
>
>
>
>>
>> As far as the datatypes, if all the rows of your CSV have the same
>> fields, the dtype for the HDF5 file should be something like:
>>
>> vl_str = h5py.new_vlen(str)
>> mydtype = numpy.dtype([('Field1', 'i4'), ('Field2', 'i4'), ('Field3',
>> vl_str), ('Field4', vl_str). ... <the rest> ...])
>>
>> This strategy will create an HDF5 dataset whose elements are a
>> compound type with two integers and four variable-length strings.
>>
>> HTH,
>> Andrew
>>
>> ----------------------------------------------------------------------
>> This mailing list is for HDF software users discussion.
>> To subscribe to this list, send a message to hdf-forum-subscribe at hdfgroup.org.
>> To unsubscribe, send a message to hdf-forum-unsubscribe at hdfgroup.org.
>>
>>
>
----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe at hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe at hdfgroup.org.
More information about the Hdf-forum
mailing list