[hdf-forum] Large number of incrementally growing extendible datasets

Björn Andres bjoern.andres at iwr.uni-heidelberg.de
Thu Jun 18 08:40:33 EDT 2009


Hello!

In a computational geometry application, I am dealing with sets of 
points in 3D space. Each point set can be represented as a matrix (e.g. 
a 2D dataset in HDF5) having as many rows as there are points, and three 
columns for the three coordinates of each point.

The exact number of sets (about 10^6) is known at the initialization of 
an algorithm while the number of points in each set is unknown. 
Incrementally, new points are computed and have to be appended to the 
datasets that have already been constructed.

I have written code (C++, HDF5 1.8.3) which creates 10^7 datasets in one 
HDF5 file. These datasets are made extendible in the first dimension 
such that new rows of coordinates can be appended.

There are on average 10 appends per dataset and each append consists on 
average of 250 points (3 kBytes). After having written about 7 GB to the 
hard drive, the performance goes down to almost zero. Note that at most 
one dataset is open at any time.

I am now wondering whether the introduction of 10^6 extendible datasets 
is a bad idea overall.

- Does HDF5 in its internal organization move around data such that 
extendible datasets (after appending new data) are contiguous?

- Does HDF5 require that the file is contiguous on the hard drive? Can 
the file system cause the problem?

- Can caching be a problem? Note that at most one dataset is open at any 
time. I close it right after having appended data.

I appreciate valuable hints!

Kind regards,
Bjoern

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe at hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe at hdfgroup.org.





More information about the Hdf-forum mailing list