[Hdf-forum] Chunk cache size and performance

Francesc Alted faltet at pytables.org
Fri Jan 8 12:13:25 EST 2010


Hi Rob,

A Friday 08 January 2010 16:10:01 Rob Latham escrigué:
> On Thu, Jan 07, 2010 at 08:30:57PM +0100, Francesc Alted wrote:
> > What I want to stress during the workshop is the dependency of I/O
> > throughput on the chunksize for a certain dataset.  For making the plots
> > that I've got (attached), I have chosen a dataset of 2 GB (2-dim, shape
> > is (512, 65536) and datatype is double precision) so that it can easily
> > fit into my OS cache memory (my machine has 8 GB) and make the effects
> > clearer.  In the X axis, I represent the chunksize for every dataset
> > (from 1 KB up to 8 MB).  In the Y axis there is the performance for
> > reading the dataset sequentially.
> 
> I'd appreciate a bit more explanation of your methodology.  You want
> to test *I/O throughput* but at the same time you want to make sure
> the data fits in memory cache.  Are you not then just testing memory
> bandwidth?

Nope.  I'd like to characterize a situation where I can maximize both 
sequential and 'semi-random' access to a file.  By 'semi-random' I mean an 
access mode that performs random access in a certain row and then repeat the 
operation with other rows.  As the shape of my 2 GB dataset is (512, 524288) 
--the stated shape in my previous message was wrong, sorry--, I need at least 
4 MB for the chunk cache size so as to maximize the access time in such a 
'semi-random' mode.  I'm attaching a couple of plots where it is shown how the 
8 MB cache works much better than the default cache size of 1 MB.

Unfortunately, choosing 8 MB does have an important impact in the sequential 
access mode, as explained in the OP.  The thing is that I don't completely 
understand why this is so (i.e. I don't understand well how the HDF5 chunk 
cache works ;-).

I'm attaching the script that I'm using for this case, if that helps to 
clarify things.  It is made in Python/PyTables, but I think it is simple 
enough to be understood, at least at high level.

> If I were running this benchmark I would be purging the memory cache
> between every run: the chunk cache is designed to improve disk
> performance, right?

That's an interesting idea.  How the HDF5 chunk cache can be purged?  However, 
my latest profiles don't suggest this could help.  I've sent these profiles to 
this list, but as they are screenshots of the kcachegrind graphical tool, the 
total size of the message exceeds what is allowed in this forum.  I'll try to 
reduce the message size and send it again.

Thanks,

-- 
Francesc Alted
-------------- next part --------------
A non-text attachment was scrubbed...
Name: random-1MB.pdf
Type: application/pdf
Size: 17323 bytes
Desc: not available
URL: <http://mail.hdfgroup.org/pipermail/hdf-forum_hdfgroup.org/attachments/20100108/4a54eeb8/attachment-0002.pdf>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: random-8MB.pdf
Type: application/pdf
Size: 17291 bytes
Desc: not available
URL: <http://mail.hdfgroup.org/pipermail/hdf-forum_hdfgroup.org/attachments/20100108/4a54eeb8/attachment-0003.pdf>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: optimal-chunksize.py
Type: text/x-python
Size: 3528 bytes
Desc: not available
URL: <http://mail.hdfgroup.org/pipermail/hdf-forum_hdfgroup.org/attachments/20100108/4a54eeb8/attachment-0001.py>


More information about the Hdf-forum mailing list