[hdf-forum] h5py -- most efficient way to load a hdf5

Mag Gam magawake at gmail.com
Wed Jun 24 07:36:49 EDT 2009


Here is some code I have,


import numpy as np
from numpy import *

import gzip
import h5py
import re
import sys, string, time, getopt
import os

src=sys.argv[1]
fs = gzip.open(src)
x=src.split("/")
filename=x[len(x)-1]

#Get YYYY/MM/DD format
YYYY=(filename.rsplit(".",2)[0])[0:4]
MM=(filename.rsplit(".",2)[0])[4:6]
DD=(filename.rsplit(".",2)[0])[6:8]

f=h5py.File('/tmp/test_foo/FE.hdf5','w')

grp="/"+YYYY
try:
  f.create_group(grp)
except ValueError:
  print "Year group already exists"

grp=grp+"/"+MM
try:
  f.create_group(grp)
except ValueError:
  print "Month group already exists"

grp=grp+"/"+DD
try:
  group=f.create_group(grp)
except ValueError:
  print "Day group already exists"


str_type=h5py.new_vlen(str)
mydescriptor = {'names': ('gender','age','weight'), 'formats': ('S1',
'f4', 'f4')}
print "Filename is: ",src
fs = gzip.open(src)

dset = f.create_dataset ('Foo',data=arr,compression='gzip')

s=0
for y in fs:
     continue
  a=y.split(',')
  s=s+1
  dset.resize(s,axis=0)
fs.close()


f.close()


This works but just takes a VERY long time.


On Tue, Jun 23, 2009 at 4:32 AM, Mag Gam<magawake at gmail.com> wrote:
> Thanks for the response
>
>
> On Tue, Jun 23, 2009 at 2:46 AM, Andrew
> Collette<andrew.collette at gmail.com> wrote:
>> Hi,
>>
>>> I have a very large csv file 14G and I am planning to move all of my
>>> data to hdf5. I am using h5py to load the data. The biggest problem I
>>> am having is, I am putting the entire file into memory and then
>>> creating a dataset from it. This is very inefficient and it takes over
>>> 4 hours to create the hdf5 file.
>>>
>>> The csv file has various types:
>>> int4, int4, str, str, str, str, str
>>
>> Since you're using Python you should investigate the functions
>> numpy.fromfile and numpy.loadtxt.  The biggest thing you should worry
>> about is finding a way to iterate over rows in the input file.  You
>> can create an HDF5 dataset with the proper size and dtype, and then
>> fill it in row by row as you read records in from the csv file.  That
>> way you avoid having to load the entire file into memory.
>
> Correct, this is the way I am trying to do it but do I have to worry
> about resize? Because each file has different number of rows.
>
> The numpy.from file and loadtxt actually load everything into memory.
> Thats the way I am doing it now, and its very inefficient.
>
> Do you have any sample code for line by line read and then push it to
> hdf5 file?
>
>
>
>>
>> As far as the datatypes, if all the rows of your CSV have the same
>> fields, the dtype for the HDF5 file should be something like:
>>
>> vl_str = h5py.new_vlen(str)
>> mydtype = numpy.dtype([('Field1', 'i4'), ('Field2', 'i4'), ('Field3',
>> vl_str), ('Field4', vl_str). ... <the rest> ...])
>>
>> This strategy will create an HDF5 dataset whose elements are a
>> compound type with two integers and four variable-length strings.
>>
>> HTH,
>> Andrew
>>
>> ----------------------------------------------------------------------
>> This mailing list is for HDF software users discussion.
>> To subscribe to this list, send a message to hdf-forum-subscribe at hdfgroup.org.
>> To unsubscribe, send a message to hdf-forum-unsubscribe at hdfgroup.org.
>>
>>
>

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe at hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe at hdfgroup.org.





More information about the Hdf-forum mailing list