[hdf-forum] Advice on way to write HDF file for optimal reading.

nfortne2 at hdfgroup.org nfortne2 at hdfgroup.org
Wed Oct 29 18:10:14 EDT 2008


Andrew,

The best way to do this depends on your priorities for performance.   
The most important parameters imfluencing read and write performance  
here are the chunk dimensions, and the raw data chunk cache (rdcc)  
configuration.  Are you using any compression on the dataset?  The  
following advice assumes you are not:

To optimize read performance, because the rows are contiguous in  
memory, you will probably want the cache to be smaller than the size  
of a chunk (see H5Pset_cache).  This will force the library to read  
the only the data you want directly from the disk.  You also want the  
chunks to be wide, to minimize the number of I/O's required.  If  
sequential rows to be read tend to be close to each other, it may be a  
good idea to increase the size of the cache to be able to fit an  
entire "row" (or more) of chunks.

To optimize write performance, you may want to change the cache size  
to fit a chunk, depending on the width of the chunk you select.  This  
will allow the library to write the entire chunk at once (per column),  
rather than seeking to each individual element.  Narrower chunks will  
be faster in this case.

To change the cache configuration between writing and reading you will  
need to close then reopen the file.

The optimum chunk height depends on how much memory you want the  
application to use.

Good luck and let me know if you have any further questions.

Neil Fortner
The HDF Group

Quoting Andrew Cunningham <andrewc at mac.com>:

> Hi,
> 	 I am looking for advice on the optimal way to R/W a HDF file in C/C++
> under the following circumstances.
> 	Essentially my data is one large array of single precision data of ,
> say , 1 million rows by , say, 3000 columns. Sometimes it can be
> smaller, of course, and may be larger, but that would be a typical
> 'large' problem.
>
> 	The data is "received" by columns 1,2,3......N. Obviously I cannot
> hold all the data in memory so I write a column 'hyperslab' one at a
> time as I get the data.
>
> 	The main requirement is I need to efficiently be able to read a random
> selection of rows. So I need to read, say, the data from a 1000 rows
> (random indices) into memory. I do not know before hand which rows will
> need to be read. Obviously the implementation is a to construct a
> hyperslab read of those rows - but is that the most efficient way?
>
> 	Any tips before I dive in?
>
> Andrew
>
>
> ----------------------------------------------------------------------
> This mailing list is for HDF software users discussion.
> To subscribe to this list, send a message to   
> hdf-forum-subscribe at hdfgroup.org.
> To unsubscribe, send a message to hdf-forum-unsubscribe at hdfgroup.org.



----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe at hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe at hdfgroup.org.





More information about the Hdf-forum mailing list