[hdf-forum] provenance

Werner Benger werner at cct.lsu.edu
Wed Mar 25 11:50:04 EDT 2009


Hi Quincey & Matthew,

>
>> 3) to track the provenance of an HDF file might be accomplished by
>> logging a UID (eg time)  when an HDF file is opened.
>> then the HDF file has collection of open times, which are unique to
>> that file.
>> If any write activity occurs after opening, then the open UID is
>> flagged as such.
>> when a file is copied, then the files diverge and are identified by
>> different open UIDs.
>>
>>
>> such a provenance scheme should be automatic and optional.
>> some instances you don't want the overhead, such as you might be
>> doing a million opens.
>> In such a case you get one UID when the HDF was created.
>
> 	Good ideas toward provenance features, yes.
>
I'm wondering how "doable" it is to add/implement something like a list
of UUID's instead of a single one. Maybe it's not more effort to add many
than adding a single one. Especcially with copied files I'm think about
something like a hierarchical scheme here, where there is a unique UUID
per file, but there also are "traces" of other UUID's from previous files
 from which this one has been copied. Of course, such a copy operation would
need to be done by some HDF5 API call, a unix copy would not do that
(unix copy would screw any uniqueness of HDF5 file UUID's anyway...).

I could imagine a creation property of some H5Fcreate() call, where you
add some id's of HDF5 source files, and all their UUID's are included
in the newly created HDF5 file as "children" of the new-born unique one.

Maybe it's too much effort, maybe it's easy...?

>>> 	We could allow an application to choose which type of UUID to
>>> store.  I've filed a bug for adding a UUID to a file and will amend
>>> it to suggest giving the application the choice of which version of
>>> the UUID to store.
>>
>> sounds good, would like to have one to choose from that is not
>> opaque, should include time, computer, username.
>> audit trails are key to provenance.
>
> 	I think we are working to different purposes here.  I'm just trying
> to get a unique ID into the file (and perhaps for each object) and
> don't want to tie it into any provenance effort.  I also want to
> pursue the provenance idea, but it should be a separate, probably
> higher-level, project (which might use the UUID for some purpose).
>
Anonymization of data is as important as keeping personalized information
as long as possible, especcially for medical data. I'd think an anonymous
UUID can always be stored, a personal one that allows to trace down
the source and generator (personal computer, IP/MAC address, exact time
of creation) then might be an optional addition. During an anonymization
process, the personalized UUID could be removed from the file (e.g. during
a copy process) and stored in a high-security external database. Just some
ideas...

	Werner


-- 
___________________________________________________________________________
Dr. Werner Benger <werner at cct.lsu.edu>               Visualization Research
Laboratory for Creative Arts and Technology (LCAT)
Center for Computation & Technology at Louisiana State University (CCT/LSU)
239 Johnston Hall, Baton Rouge, Louisiana 70803
Tel.: +1 225 578 4809                        Fax.: +1 225 578-5362




More information about the Hdf-forum mailing list