[Hdf-forum] Chunk cache size and performance

Quincey Koziol koziol at hdfgroup.org
Tue Jan 19 16:50:26 EST 2010


Hi Ger,

On Jan 14, 2010, at 1:59 AM, Ger van Diepen wrote:

> Hi Neil,
> 
> Thanks for your answer.
> I'm afraid I do not fully understand what you mean with a vector of
> chunks and a plane of chunks.
> Is a vector of chunks the collection of chunks you need to access an
> entire vector in some direction? E.g. In a 3-dim case to get i,*,k (a
> vector in y)? To process all such vectors (e.g. to average them) you
> need to iterate over i and k.
> And is a plane of chunks all chunks you need to get a plane, say,
> *,*,k?

	Yes, I believe that's what Neil meant.

> I imagine the cache to be a collection of chunks, say, a
> std::vector<chunk> in C++ terms. Ideally the cache should be so large
> that all chunks you need to get a hyperslab fit in it. It means that
> when getting the next hyperslab all data needed are usually still in the
> cache. So at the end, after iterating through the entire cube, all
> chunks are read only once. So a chunk should never be read back as you
> describe in your pictures.

	Yes, that's the ideal situation, but you've got to set the cache size large enough to make this occur.

> My idea would be that when reading, say, 0,*,0 HDF5 knows it needs,
> say, chunk 0, 4, 8, and 12. So it can first test if they are in the
> cache (using the chunk cache hash) before searching the chunk Btree. If
> the cache is large enough, these tests are almost always true. So a
> Btree lookup needs to be done seldomly (in fact only when a chunk needs
> to be read). I got the impression that HDF5 is doing a Btree lookup for
> each hyperslab access instead of using the approach I sketched. I assume
> that a Btree lookup is much more expensive than a hash lookup.
> But maybe my idea of what the chunk cache looks like and how HDF5 uses
> it is totally wrong.  Quincey or you can correct me.

	The B-tree lookup is currently what maps the dataspace coordinates to the address of a chunk, and then the chunk in the cache is looked up by it's address.  That being said, I see no reason why the chunks in the cache couldn't have the dataspace coordinates attached to them and have the cache searched before the B-tree was queried.  I'll file a bug for this in our issue tracker.

> In my test program the cache is sized that way. The test program uses
> 3-dim arrays and can iterate over vectors in any direction (say get
> i,*,k for all i and k) or iterate over chunks. I'll attach the test
> program, so you can see it. It knows the access pattern, so it sizes the
> cache beforehand.
> Note that in the vector case the naive iteration is over all i and then
> over all k which requires a large cache. You can do better than that by
> first iterating over the chunks in i and k and then inside these chunks
> iterate over i and k. It requires a much smaller cache.
> 
> We also use the casacore package containing an RDBMS-like table system
> that stores arrays in a chunked way and uses a cache when accessing
> them. It performs way better when iterating over vectors in the way
> described above. So it should be possible to improve HDF5 considerably.
> BTW. casacore is moving towards using mmap when possible, so the OS
> takes care of caching.

	Yes, I think if we improve our cache lookup strategy, we should be able to improve our I/O performance also.  Thanks for the idea!

	Quincey

> Cheers,
> Ger
> 
>>>> On 1/13/2010 at  5:12 PM, in message
> <4B4DF0D7.9020209 at hdfgroup.org>, Neil
> Fortner <nfortne2 at hdfgroup.org> wrote: 
>> Hi all,
>> 
>> I've been following this for a while, and while I still don't have a
> 
>> complete picture of what the benchmark is doing, I have an idea that
> 
>> might shed some light on this.
>> 
>> My understanding is that the chunk cache is set large enough to hold
> one 
>> vector of chunks but not one plane of chunks, correct?  If the
> vectors 
>> are iterated over in the obvious way (i.e. one plane at a time) then
> 
>> each vector of chunks in the chunk cache will have to be thrown away
> 
>> once for every plane that passes through the chunks, obviously
> hurting 
>> performance.  In this case, it would be better to disable the chunk 
>> cache entirely (set nbytes or nslots to 0), hence why Francesc sees 
>> better performance with the smaller chunk cache.  Ideally, you would
> 
>> either use a chunk cache large enough for a plane of chunks, or
> iterate 
>> over the entire vector of chunks before moving on to the next vector
> of 
>> chunks in the plane.
>> 
>> Does this make sense?
>> 
>> Here are some (2-D) diagrams to help illustrate what I'm talking
> about:
>> 
>> In cache
>>    |
>>   V
>> +----+----+----+
>> |X   |    |    |
>> |    |    |    |
>> +----+----+----+
>> |    |    |    |
>> |    |    |    |
>> +----+----+----+
>> 
>> +----+----+----+
>> |.X  |    |    |
>> |    |    |    |
>> +----+----+----+
>> |
>   |    |    |
>> |    |    |    |
>> +----+----+----+
>> 
>> +----+----+----+
>> |..X |    |    |
>> |    |    |    |
>> +----+----+----+
>> |    |    |    |
>> |    |    |    |
>> +----+----+----+
>> 
>> +----+----+----+
>> |...X|    |    |
>> |    |    |    |
>> +----+----+----+
>> |    |    |    |
>> |    |    |    |
>> +----+----+----+
>> 
>> Evicted
>>   |   In cache
>>  V       V
>> +----+----+----+
>> |....|X   |    |
>> |    |    |    |
>> +----+----+----+
>> |    |    |    |
>> |    |    |    |
>> +----+----+----+
>> 
>> +----+----+----+
>> |....|.X  |    |
>> |    |    |    |
>> +----+----+----+
>> |    |    |    |
>> |    |    |    |
>> +----+----+----+
>> 
>> ...
>>              In cache
>>                  V
>> +----+----+----+
>> |....|....|...X|
>> |    |    |    |
>> +----+----+----+
>> |    |    |    |
>> |    |    |    |
>> +----+----+----+
>> 
>> Read back into cache from disk for the second time
>>    |         Evicted
>>    V            V
>> +----+----+----+
>> |....|....|....|
>> |X   |    |    |
>> +----+----+----+
>> |    |    |    |
>> |    |    |    |
>> +----+----+----+
>> 
>> ...
>> 
>> -Neil
>> 
>> 
>> 
>> On 01/12/2010 09:10 AM, Ger van Diepen wrote:
>>> Hi Quincey,
>>> 
>>> It uses fixed dimensions.
>>> It iterates over vectors in the x, y or z direction. In the test the
> z
>>> axis was quite short, so it results in many small hyperslab
> accesses.
>>> Of course, I could access a plane at a time and get vectors myself,
> but
>>> I would think that is just what the cache should do for me. Besides,
> I
>>> do not always know if a plane fits in memory.
>>> 
>>> I can imagine that doing so many btree lookups is quite expensive.
> Do
>>> you think that is the main performance bottleneck?
>>> But why so many btree lookups?
>>> Because the chunk and array size are fixed, it can derive the chunks
> to
>>> use from the hyperslab coordinates. Then it can first look if they
> are
>>> in the cache which is (I assume) much faster than looking in the
> chunk
>>> Btree.
>>> Does the experimental branch improve in this way?
>>> 
>>> 
>>> Not that Francesc's original question is still open. Why does
>>> performance degrade when using a larger chunk cache?
>>> 
>>> Cheers,
>>> Ger
>>> 
>>> 
>>>>>> On 1/12/2010 at  3:27 PM, in message
>>>>>> 
>>> <F2ABECC0-20F4-4A1E-B178-BDE7BAEE7982 at hdfgroup.org>, Quincey Koziol
>>> <koziol at hdfgroup.org>  wrote:
>>> 
>>>> Hi Francesc,
>>>> 
>>>> On Jan 8, 2010, at 11:45 AM, Francesc Alted wrote:
>>>> 
>>>> 
>>>>> Hi Ger,
>>>>> 
>>>>> A Friday 08 January 2010 08:25:09 escriguéreu:
>>>>> 
>>>>>> Hi Francesc,
>>>>>> 
>>>>>> This might be related to a problem I reported last June.
>>>>>> I did tests using a 3-dim array with various chunk shapes and
>>>>>> 
>>> access
>>> 
>>>>>> patterns. It got very slow when iterating through the data by
>>>>>> 
>>> vector in
>>> 
>>>>>> the Z-direction. I believe it was filed as a bug by the HDF5
> group.
>>>>>> 
>>> I
>>> 
>>>>>> sent a test program to Quincey that shows the behaviour. I'll
>>>>>> 
>>> forward
>>> 
>>>>>> that mail and the test program to you, so you can try it out
>>>>>> 
>>> yourself if
>>> 
>>>>>> you like to.
>>>>>> 
>>>>>> I suspect the cache lookup algorithm to be the culprit. The
> larger
>>>>>> 
>>> the
>>> 
>>>>>> cache and the more often it has to look up, the slower things
> get.
>>>>>> 
>>> BTW,
>>> 
>>>>>> Did you adapt the cache's hash size to the number of slots in
> the
>>>>>> 
>>> cache?
>>> 
>>>>> Thanks for your suggestion.  I've been looking at your problem,
> and
>>>>> 
>>> my
>>> 
>>>>> profiles seem to say that it is not a cache issue.
>>>>> 
>>>>> Have a look at the attached screenshots showing profiles for your
>>>>> 
>>> test bed
>>> 
>>>>> reading in the x axis with a cache size of 4 KB (the HDF5 cache
>>>>> 
>>> subsystem
>>> 
>>>> does
>>>> 
>>>>> not enters in action at all) and 256 KB (your size).  I've also
>>>>> 
>>> added a
>>> 
>>>>> profile
> for the tiled case for comparison purposes.  For all the
>>>>> 
>>> profiles
>>> 
>>>>> (except tiled), the bottleneck is clearly in the `H5FL_reg_free`
> and
>>>>> 
>>> 
>>>>> `H5FL_reg_malloc` calls, no matter how large the cache size is
> (even
>>>>> 
>>> if it
>>> 
>>>>> does not enters in action).
>>>>> 
>>>>> I think this is expected, because HDF5 has to reserve space for
> each
>>>>> 
>>> I/O
>>> 
>>>>> operation.  When you walk the dataset following directions x or
> y,
>>>>> 
>>> you have
>>> 
>>>> to
>>>> 
>>>>> do (32*2)x more I/O operations than for the tiled case, and HDF5
>>>>> 
>>> needs to
>>> 
>>>> book
>>>> 
>>>>> (and free again!) (32*2)x more memory areas.  Also, when you read
>>>>> 
>>> through
>>> 
>>>> the
>>>> 
>>>>> z axis, the additional times to book/release memory is (32*32)x.
>>>>> 
>>> All of
>>> 
>>>> this
>>>> 
>>>>> is consistent with both profiles and running the benchmark
>>>>> 
>>> manually:
>>> 
>>>>> faltet at antec:/tmp>  time ./tHDF5 1024 1024 10 32 32 2 t
>>>>> real    0m0.057s
>>>>> user    0m0.048s
>>>>> sys     0m0.004s
>>>>> 
>>>>> faltet at antec:/tmp>  time ./tHDF5 1024 1024 10 32 32 2 x
>>>>> setting cache to 32 chunks (4096 bytes) with 3203 slots  //
> forcing
>>>>> 
>>> no cache
>>> 
>>>>> real    0m1.055s
>>>>> user    0m0.860s
>>>>> sys     0m0.168s
>>>>> 
>>>>> faltet at antec:/tmp>  time ./tHDF5 1024 1024 10 32 32 2 y
>>>>> setting cache to 32 chunks (262144 bytes) with 3203 slots
>>>>> real    0m1.211s
>>>>> user    0m1.176s
>>>>> sys     0m0.028s
>>>>> 
>>>>> faltet at antec:/tmp>  time ./tHDF5 1024 1024 10 32 32 2 z
>>>>> setting cache to 5 chunks (40960 bytes) with 503 slots
>>>>> real    0m14.813s
>>>>> user    0m14.777s
>>>>> sys     0m0.024s
>>>>> 
>>>>> So, in my opinion, there is little that HDF5 can do here.  You
>>>>> 
>>> should better
>>> 
>>>> 
>>>>> adapt the chunk shape to your most used case (if you have just
> one,
>>>>> 
>>> but I
>>> 
>>>> know
>>>> 
>>>>> 
>>>  that this is not typically the case).
>>> 
>>>> 	Interesting results.  I'll try to comment on a number of
>>>> 
>>> fronts:
>>> 
>>>> 		- The H5FL_* routines are an internal "free list" that
>>>> 
>>> the HDF5 library uses,
>>> 
>>>> in order to avoid calling malloc() as frequently.  You can disable
>>>> 
>>> them for
>>> 
>>>> your benchmarking by configuring the library with the
>>>> "--enable-using-memchecker".
>>>> 
>>>> 		- It looks like you are using the hyperslab routines
>>>> 
>>> quite a bit, are you
>>> 
>>>> creating complicated hyperslabs?
>>>> 
>>>> 		- The B-tree comparison routine is being used to locate
>>>> 
>>> the correct chunk to
>>> 
>>>> perform I/O on, which may or may not be in the cache.  Do your
>>>> 
>>> datasets have
>>> 
>>>> fixed or unlimited dimensions?  If they are fixed (or only have 1
>>>> 
>>> unlimited
>>> 
>>>> dimension), I can point you at an experimental branch with the new
>>>> 
>>> chunk
>>> 
>>>> indexing features and we can see if that would help your
>>>> 
>>> performance.
>>> 
>>>> 	Quincey
>>>> 
>>>> 
>>>>>> In your tests you only mention the chunk size, but not the chunk
>>>>>> 
>>> shape.
>>> 
>>>>>> Isn't that important? It gives me the impression that in your
> tests
>>>>>> 
>>> the
>>> 
>>>>>> data are stored and accessed fully sequentially which makes the
>>>>>> 
>>> cache
>>> 
>>>>>> useless.
>>>>>> 
>>>>> Yes, chunk shape is important, sorry, I forgot this important
>>>>> 
>>> detail.  As I
>>> 
>>>>> mentioned in a previous message to Rob Latham, I want to optimize
>>>>> 
>>> 'semi-
>>> 
>>>>> random' access mode in a certain row of the dataset, so I
> normally
>>>>> 
>>> choose
>>> 
>>>> the
>>>> 
>>>>> chunk shape as (1, X), where X is the needed value for obtaining
> a
>>>>> 
>>> chunksize
>>> 
>>>> 
>>>>> between 1 KB and 8 MB --if
> X is larger than the maximum number of
>>>>> 
>>> columns, I
>>> 
>>>>> expand the number of rows in the chunk shape accordingly.
>>>>> 
>>>>> Thanks,
>>>>> 
>>>>> -- 
>>>>> Francesc Alted
>>>>> 
>>>>> 
>>>> 
>>> 
>> 
> <tHDF5-tile.png><tHDF5-x-4KB.png><tHDF5-x-256KB.png>_______________________________
>> _____
>>> 
>>>> ___________
>>>> 
>>>>> Hdf-forum is for HDF software users discussion.
>>>>> Hdf-forum at hdfgroup.org
>>>>> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
>>>>> 
>>>> 
>>>> _______________________________________________
>>>> Hdf-forum is for HDF software users discussion.
>>>> Hdf-forum at hdfgroup.org
>>>> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
>>>> 
>>> 
>>> _______________________________________________
>>> Hdf-forum is for HDF software users discussion.
>>> Hdf-forum at hdfgroup.org
>>> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
>>> 
>> 
>> 
>> _______________________________________________
>> Hdf-forum is for HDF software users discussion.
>> Hdf-forum at hdfgroup.org
>> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
> 
> <tHDF5.cc>_______________________________________________
> Hdf-forum is for HDF software users discussion.
> Hdf-forum at hdfgroup.org
> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org




More information about the Hdf-forum mailing list