[Hdf-forum] Chunk cache size and performance
Quincey Koziol
koziol at hdfgroup.org
Tue Jan 19 16:50:26 EST 2010
Hi Ger,
On Jan 14, 2010, at 1:59 AM, Ger van Diepen wrote:
> Hi Neil,
>
> Thanks for your answer.
> I'm afraid I do not fully understand what you mean with a vector of
> chunks and a plane of chunks.
> Is a vector of chunks the collection of chunks you need to access an
> entire vector in some direction? E.g. In a 3-dim case to get i,*,k (a
> vector in y)? To process all such vectors (e.g. to average them) you
> need to iterate over i and k.
> And is a plane of chunks all chunks you need to get a plane, say,
> *,*,k?
Yes, I believe that's what Neil meant.
> I imagine the cache to be a collection of chunks, say, a
> std::vector<chunk> in C++ terms. Ideally the cache should be so large
> that all chunks you need to get a hyperslab fit in it. It means that
> when getting the next hyperslab all data needed are usually still in the
> cache. So at the end, after iterating through the entire cube, all
> chunks are read only once. So a chunk should never be read back as you
> describe in your pictures.
Yes, that's the ideal situation, but you've got to set the cache size large enough to make this occur.
> My idea would be that when reading, say, 0,*,0 HDF5 knows it needs,
> say, chunk 0, 4, 8, and 12. So it can first test if they are in the
> cache (using the chunk cache hash) before searching the chunk Btree. If
> the cache is large enough, these tests are almost always true. So a
> Btree lookup needs to be done seldomly (in fact only when a chunk needs
> to be read). I got the impression that HDF5 is doing a Btree lookup for
> each hyperslab access instead of using the approach I sketched. I assume
> that a Btree lookup is much more expensive than a hash lookup.
> But maybe my idea of what the chunk cache looks like and how HDF5 uses
> it is totally wrong. Quincey or you can correct me.
The B-tree lookup is currently what maps the dataspace coordinates to the address of a chunk, and then the chunk in the cache is looked up by it's address. That being said, I see no reason why the chunks in the cache couldn't have the dataspace coordinates attached to them and have the cache searched before the B-tree was queried. I'll file a bug for this in our issue tracker.
> In my test program the cache is sized that way. The test program uses
> 3-dim arrays and can iterate over vectors in any direction (say get
> i,*,k for all i and k) or iterate over chunks. I'll attach the test
> program, so you can see it. It knows the access pattern, so it sizes the
> cache beforehand.
> Note that in the vector case the naive iteration is over all i and then
> over all k which requires a large cache. You can do better than that by
> first iterating over the chunks in i and k and then inside these chunks
> iterate over i and k. It requires a much smaller cache.
>
> We also use the casacore package containing an RDBMS-like table system
> that stores arrays in a chunked way and uses a cache when accessing
> them. It performs way better when iterating over vectors in the way
> described above. So it should be possible to improve HDF5 considerably.
> BTW. casacore is moving towards using mmap when possible, so the OS
> takes care of caching.
Yes, I think if we improve our cache lookup strategy, we should be able to improve our I/O performance also. Thanks for the idea!
Quincey
> Cheers,
> Ger
>
>>>> On 1/13/2010 at 5:12 PM, in message
> <4B4DF0D7.9020209 at hdfgroup.org>, Neil
> Fortner <nfortne2 at hdfgroup.org> wrote:
>> Hi all,
>>
>> I've been following this for a while, and while I still don't have a
>
>> complete picture of what the benchmark is doing, I have an idea that
>
>> might shed some light on this.
>>
>> My understanding is that the chunk cache is set large enough to hold
> one
>> vector of chunks but not one plane of chunks, correct? If the
> vectors
>> are iterated over in the obvious way (i.e. one plane at a time) then
>
>> each vector of chunks in the chunk cache will have to be thrown away
>
>> once for every plane that passes through the chunks, obviously
> hurting
>> performance. In this case, it would be better to disable the chunk
>> cache entirely (set nbytes or nslots to 0), hence why Francesc sees
>> better performance with the smaller chunk cache. Ideally, you would
>
>> either use a chunk cache large enough for a plane of chunks, or
> iterate
>> over the entire vector of chunks before moving on to the next vector
> of
>> chunks in the plane.
>>
>> Does this make sense?
>>
>> Here are some (2-D) diagrams to help illustrate what I'm talking
> about:
>>
>> In cache
>> |
>> V
>> +----+----+----+
>> |X | | |
>> | | | |
>> +----+----+----+
>> | | | |
>> | | | |
>> +----+----+----+
>>
>> +----+----+----+
>> |.X | | |
>> | | | |
>> +----+----+----+
>> |
> | | |
>> | | | |
>> +----+----+----+
>>
>> +----+----+----+
>> |..X | | |
>> | | | |
>> +----+----+----+
>> | | | |
>> | | | |
>> +----+----+----+
>>
>> +----+----+----+
>> |...X| | |
>> | | | |
>> +----+----+----+
>> | | | |
>> | | | |
>> +----+----+----+
>>
>> Evicted
>> | In cache
>> V V
>> +----+----+----+
>> |....|X | |
>> | | | |
>> +----+----+----+
>> | | | |
>> | | | |
>> +----+----+----+
>>
>> +----+----+----+
>> |....|.X | |
>> | | | |
>> +----+----+----+
>> | | | |
>> | | | |
>> +----+----+----+
>>
>> ...
>> In cache
>> V
>> +----+----+----+
>> |....|....|...X|
>> | | | |
>> +----+----+----+
>> | | | |
>> | | | |
>> +----+----+----+
>>
>> Read back into cache from disk for the second time
>> | Evicted
>> V V
>> +----+----+----+
>> |....|....|....|
>> |X | | |
>> +----+----+----+
>> | | | |
>> | | | |
>> +----+----+----+
>>
>> ...
>>
>> -Neil
>>
>>
>>
>> On 01/12/2010 09:10 AM, Ger van Diepen wrote:
>>> Hi Quincey,
>>>
>>> It uses fixed dimensions.
>>> It iterates over vectors in the x, y or z direction. In the test the
> z
>>> axis was quite short, so it results in many small hyperslab
> accesses.
>>> Of course, I could access a plane at a time and get vectors myself,
> but
>>> I would think that is just what the cache should do for me. Besides,
> I
>>> do not always know if a plane fits in memory.
>>>
>>> I can imagine that doing so many btree lookups is quite expensive.
> Do
>>> you think that is the main performance bottleneck?
>>> But why so many btree lookups?
>>> Because the chunk and array size are fixed, it can derive the chunks
> to
>>> use from the hyperslab coordinates. Then it can first look if they
> are
>>> in the cache which is (I assume) much faster than looking in the
> chunk
>>> Btree.
>>> Does the experimental branch improve in this way?
>>>
>>>
>>> Not that Francesc's original question is still open. Why does
>>> performance degrade when using a larger chunk cache?
>>>
>>> Cheers,
>>> Ger
>>>
>>>
>>>>>> On 1/12/2010 at 3:27 PM, in message
>>>>>>
>>> <F2ABECC0-20F4-4A1E-B178-BDE7BAEE7982 at hdfgroup.org>, Quincey Koziol
>>> <koziol at hdfgroup.org> wrote:
>>>
>>>> Hi Francesc,
>>>>
>>>> On Jan 8, 2010, at 11:45 AM, Francesc Alted wrote:
>>>>
>>>>
>>>>> Hi Ger,
>>>>>
>>>>> A Friday 08 January 2010 08:25:09 escriguéreu:
>>>>>
>>>>>> Hi Francesc,
>>>>>>
>>>>>> This might be related to a problem I reported last June.
>>>>>> I did tests using a 3-dim array with various chunk shapes and
>>>>>>
>>> access
>>>
>>>>>> patterns. It got very slow when iterating through the data by
>>>>>>
>>> vector in
>>>
>>>>>> the Z-direction. I believe it was filed as a bug by the HDF5
> group.
>>>>>>
>>> I
>>>
>>>>>> sent a test program to Quincey that shows the behaviour. I'll
>>>>>>
>>> forward
>>>
>>>>>> that mail and the test program to you, so you can try it out
>>>>>>
>>> yourself if
>>>
>>>>>> you like to.
>>>>>>
>>>>>> I suspect the cache lookup algorithm to be the culprit. The
> larger
>>>>>>
>>> the
>>>
>>>>>> cache and the more often it has to look up, the slower things
> get.
>>>>>>
>>> BTW,
>>>
>>>>>> Did you adapt the cache's hash size to the number of slots in
> the
>>>>>>
>>> cache?
>>>
>>>>> Thanks for your suggestion. I've been looking at your problem,
> and
>>>>>
>>> my
>>>
>>>>> profiles seem to say that it is not a cache issue.
>>>>>
>>>>> Have a look at the attached screenshots showing profiles for your
>>>>>
>>> test bed
>>>
>>>>> reading in the x axis with a cache size of 4 KB (the HDF5 cache
>>>>>
>>> subsystem
>>>
>>>> does
>>>>
>>>>> not enters in action at all) and 256 KB (your size). I've also
>>>>>
>>> added a
>>>
>>>>> profile
> for the tiled case for comparison purposes. For all the
>>>>>
>>> profiles
>>>
>>>>> (except tiled), the bottleneck is clearly in the `H5FL_reg_free`
> and
>>>>>
>>>
>>>>> `H5FL_reg_malloc` calls, no matter how large the cache size is
> (even
>>>>>
>>> if it
>>>
>>>>> does not enters in action).
>>>>>
>>>>> I think this is expected, because HDF5 has to reserve space for
> each
>>>>>
>>> I/O
>>>
>>>>> operation. When you walk the dataset following directions x or
> y,
>>>>>
>>> you have
>>>
>>>> to
>>>>
>>>>> do (32*2)x more I/O operations than for the tiled case, and HDF5
>>>>>
>>> needs to
>>>
>>>> book
>>>>
>>>>> (and free again!) (32*2)x more memory areas. Also, when you read
>>>>>
>>> through
>>>
>>>> the
>>>>
>>>>> z axis, the additional times to book/release memory is (32*32)x.
>>>>>
>>> All of
>>>
>>>> this
>>>>
>>>>> is consistent with both profiles and running the benchmark
>>>>>
>>> manually:
>>>
>>>>> faltet at antec:/tmp> time ./tHDF5 1024 1024 10 32 32 2 t
>>>>> real 0m0.057s
>>>>> user 0m0.048s
>>>>> sys 0m0.004s
>>>>>
>>>>> faltet at antec:/tmp> time ./tHDF5 1024 1024 10 32 32 2 x
>>>>> setting cache to 32 chunks (4096 bytes) with 3203 slots //
> forcing
>>>>>
>>> no cache
>>>
>>>>> real 0m1.055s
>>>>> user 0m0.860s
>>>>> sys 0m0.168s
>>>>>
>>>>> faltet at antec:/tmp> time ./tHDF5 1024 1024 10 32 32 2 y
>>>>> setting cache to 32 chunks (262144 bytes) with 3203 slots
>>>>> real 0m1.211s
>>>>> user 0m1.176s
>>>>> sys 0m0.028s
>>>>>
>>>>> faltet at antec:/tmp> time ./tHDF5 1024 1024 10 32 32 2 z
>>>>> setting cache to 5 chunks (40960 bytes) with 503 slots
>>>>> real 0m14.813s
>>>>> user 0m14.777s
>>>>> sys 0m0.024s
>>>>>
>>>>> So, in my opinion, there is little that HDF5 can do here. You
>>>>>
>>> should better
>>>
>>>>
>>>>> adapt the chunk shape to your most used case (if you have just
> one,
>>>>>
>>> but I
>>>
>>>> know
>>>>
>>>>>
>>> that this is not typically the case).
>>>
>>>> Interesting results. I'll try to comment on a number of
>>>>
>>> fronts:
>>>
>>>> - The H5FL_* routines are an internal "free list" that
>>>>
>>> the HDF5 library uses,
>>>
>>>> in order to avoid calling malloc() as frequently. You can disable
>>>>
>>> them for
>>>
>>>> your benchmarking by configuring the library with the
>>>> "--enable-using-memchecker".
>>>>
>>>> - It looks like you are using the hyperslab routines
>>>>
>>> quite a bit, are you
>>>
>>>> creating complicated hyperslabs?
>>>>
>>>> - The B-tree comparison routine is being used to locate
>>>>
>>> the correct chunk to
>>>
>>>> perform I/O on, which may or may not be in the cache. Do your
>>>>
>>> datasets have
>>>
>>>> fixed or unlimited dimensions? If they are fixed (or only have 1
>>>>
>>> unlimited
>>>
>>>> dimension), I can point you at an experimental branch with the new
>>>>
>>> chunk
>>>
>>>> indexing features and we can see if that would help your
>>>>
>>> performance.
>>>
>>>> Quincey
>>>>
>>>>
>>>>>> In your tests you only mention the chunk size, but not the chunk
>>>>>>
>>> shape.
>>>
>>>>>> Isn't that important? It gives me the impression that in your
> tests
>>>>>>
>>> the
>>>
>>>>>> data are stored and accessed fully sequentially which makes the
>>>>>>
>>> cache
>>>
>>>>>> useless.
>>>>>>
>>>>> Yes, chunk shape is important, sorry, I forgot this important
>>>>>
>>> detail. As I
>>>
>>>>> mentioned in a previous message to Rob Latham, I want to optimize
>>>>>
>>> 'semi-
>>>
>>>>> random' access mode in a certain row of the dataset, so I
> normally
>>>>>
>>> choose
>>>
>>>> the
>>>>
>>>>> chunk shape as (1, X), where X is the needed value for obtaining
> a
>>>>>
>>> chunksize
>>>
>>>>
>>>>> between 1 KB and 8 MB --if
> X is larger than the maximum number of
>>>>>
>>> columns, I
>>>
>>>>> expand the number of rows in the chunk shape accordingly.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> --
>>>>> Francesc Alted
>>>>>
>>>>>
>>>>
>>>
>>
> <tHDF5-tile.png><tHDF5-x-4KB.png><tHDF5-x-256KB.png>_______________________________
>> _____
>>>
>>>> ___________
>>>>
>>>>> Hdf-forum is for HDF software users discussion.
>>>>> Hdf-forum at hdfgroup.org
>>>>> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
>>>>>
>>>>
>>>> _______________________________________________
>>>> Hdf-forum is for HDF software users discussion.
>>>> Hdf-forum at hdfgroup.org
>>>> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
>>>>
>>>
>>> _______________________________________________
>>> Hdf-forum is for HDF software users discussion.
>>> Hdf-forum at hdfgroup.org
>>> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
>>>
>>
>>
>> _______________________________________________
>> Hdf-forum is for HDF software users discussion.
>> Hdf-forum at hdfgroup.org
>> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
>
> <tHDF5.cc>_______________________________________________
> Hdf-forum is for HDF software users discussion.
> Hdf-forum at hdfgroup.org
> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
More information about the Hdf-forum
mailing list