[hdf-forum] data layout for parallel read-only access
Mark Howison
mark.howison at gmail.com
Thu Jun 25 10:28:29 EDT 2009
Hi Mark,
Can you give us some idea of the size of these files, the HDF5 VFD you
are using, the dimensionality of the datasets, and the storage method
(contig vs. chunk)? Also, I assume the slowdown you are reporting here
is not on a lustre, but on whatever filesystem you are using in OS X,
probably HFS+?
My guess is that mmap is performing large, contiguous IO operations,
while for some reason HDF5 isn't (possibly why you are hearing disk
thrashing).
Mark
On Thu, Jun 25, 2009 at 7:20 AM, Mark Moll<mmoll at cs.rice.edu> wrote:
> [Just to summarize the discussion below: I have an MPI-based program that
> needs parallel, asynchronous, read-only access to different data sets. No
> two processes need access to the same data sets. In an old implementation
> without HDF5 this is done by dividing the data sets into a number of files
> that is a multiple of the number of processors and use mmap to access the
> data in each file. This is rather inflexible, so I started to look at HDF5.
> With HDF5 the program runs much slower.]
>
> Initially, I thought the HDF5 version was much slower because all the data
> sets were stored in one file, but now I see the same thing happening when I
> put every data set into a separate file (# data sets >> # processors). I am
> trying to figure out what's going on with Shark and ThreadViewer on OS X.
> The profiles of the two versions of the program (with and without HDF5) look
> similar, except for some system calls in the HDF5 version. I actually hear
> the program hammering the disk in the HDF5 version, so I thought I'd use
> ThreadViewer to see whether the program was blocked. The picture below shows
> the difference between a thread in the HDF5 version (top) and the mmap-based
> version (bottom). Light green means the thread is running, dark means
> uninterruptible (usually system call), and yellow means recently running.
> Clearly, there's a huge difference, but I don't know how to dig any deeper
> to diagnose this issue.
>
>
>
>
>
> I use OpenMPI 1.3.2, HDF5 1.8.3, and gcc 4.3.3 on OS X 10.5.7 (although I
> have observed the same behavior on Ubuntu Linux and RHEL with Intel and PGI
> compilers). I have also run the program through valgrind (on Linux and OS X)
> to see if that would turn up any suspicious activity, but to no avail.
>
> On Jun 23, 2009, at 9:24 AM, Quincey Koziol wrote:
>
>> Hi Mark,
>>
>> On Jun 23, 2009, at 9:02 AM, Mark Moll wrote:
>>
>>> On Jun 23, 2009, at 8:53 AM, Quincey Koziol wrote:
>>>>
>>>> On Jun 23, 2009, at 12:34 AM, Mark Howison wrote:
>>>>>
>>>>> On Fri, Jun 19, 2009 at 9:47 AM, Mark Moll<mmoll at cs.rice.edu> wrote:
>>>>>>
>>>>>> Just a few clarifications:
>>>>>> - The reading of data sets is mostly asynchronous; each node reach its
>>>>>> "own"
>>>>>> data sets.
>>>>>
>>>>> Hmm, so each dataset belongs to only one node? It sounds like you
>>>>> might not want to go the parallel HDF5 route. If the access is
>>>>> asynchronous then you don't want to be using synchronized collective
>>>>> calls.
>>>>
>>>> Yes, this was my thought also. If the file is read-only, can
>>>> each process open it independently?
>>>
>>> Right, this is what I did. However, multiple processes trying to read the
>>> same file seems to introduce significantly more system overhead than
>>> multiple processes reading different files.
>>
>> Are you opening the file with MPI or with the default HDF5 file
>> driver?
>>
>> It does sound more like a file system issue than something with
>> HDF5...
>>
>> Quinceyfor
>>
>>>>>> - File-per-node is indeed a problem with varying concurrency. That's
>>>>>> why I
>>>>>> was thinking of a data-set-per-file organization. Each file needs to
>>>>>> be
>>>>>> accessed by only one node. Each node would have to read many files.
>>>>>
>>>>> The disadvantage to dataset-per-file is that it could lead to lots of
>>>>> files. We have had problems in lustre with using basic filesystem
>>>>> commands (ls, mv, cp, etc.) on large collections of small files (in a
>>>>> recent example, 400K files totaling 230TB). Those commands will fail
>>>>> with an error like "argument list too long..." Although, this may be a
>>>>> limitation of those commands that isn't just specific to lustre.
>>>
>>> I'm now experimenting with the data-set-per-file approach. I have an
>>> index file that contains (among other things) the file location of each data
>>> set. The data sets are stored in an automatically created directory
>>> hierarchy, so that no directory will have too many files.
>>>
>>>
>>>>>> So the question is whether there is a performance difference between
>>>>>> parallel asynchronous reading of many files vs. one large file. I'm
>>>>>> guessing HDF5 does some preemptive fetching, which would work in favor
>>>>>> of
>>>>>> one large file.
>>>>>
>>>>> I don't think that HDF5 will do any speculative read-aheads within a
>>>>> file, if that is what you mean by preemptive fetching. It will only
>>>>> read the regions you specify with a hyperslab selection.
>>>>>
>>>>> Collective/parallel access only makes sense if you have many nodes
>>>>> selecting different hyperslabs within the same dataset. If each node
>>>>> is loading a different dataset, I think that collective access will
>>>>> lead to unnecessary MPI communication, which will only become worse as
>>>>> you scale up.
>>>
>>>
>>> Thanks for clarifying. That was my impression, but I wanted to make sure
>>> before giving up on that route.
>>>
>>> --
>>> Mark
>>>
>>>
>>>
>>>
>>
>
> --
> Mark
>
>
>
>
>
> ----------------------------------------------------------------------
> This mailing list is for HDF software users discussion.
> To subscribe to this list, send a message to
> hdf-forum-subscribe at hdfgroup.org.
> To unsubscribe, send a message to hdf-forum-unsubscribe at hdfgroup.org.
>
----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe at hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe at hdfgroup.org.
More information about the Hdf-forum
mailing list