[hdf-forum] modelling tick data

Tito Ingargiola tito at puppetmastertrading.com
Sun Jan 4 13:53:45 EST 2009


Hi,

Francesc & Ray, I wanted to thank you for your detailed and
excellent insights into using hdf5 for my application.  

> [...] it would be really interesting to know which approach (OTPI 
> vs OBT) finally works best for you.

I've spent some
time over the holidays looking into both approaches and have blogged some thoughts on it here:

http://www.puppetmastertrading.com/blog/2009/01/04/managing-tick-data-with-hdf5/

and will follow-up with some more detailed results within the coming week.

Many thanks for your help!

     Tito.




________________________________
From: Francesc Alted <faltet at pytables.com>
To: hdf-forum at hdfgroup.org
Sent: Wednesday, December 24, 2008 1:54:24 PM
Subject: Re: [hdf-forum] modelling tick data

A Wednesday 24 December 2008, Tito Ingargiola escrigué:
> Hi Ray & Francesc,
>
> Thank you both very much for your informative responses!  Although
> each of you champions a different approach, both provide helpful
> ideas - thank you.  (Indeed, the fact that both approaches seem
> workable is informative on its own!)  I also agree that
> experimentation here is king, but I want to balance reasonable
> experimentation with not reinventing the wheel when so many are
> already zipping along...
>
> Francesc, I had already read these excellent docs you have written
> and am very impressed by the work you've done with pytables.  It
> sounds like you're leary of the overhead of an OTPI approach and I
> can see why.  A couple further question for you about indexing into
> OBT - while my data layout from a column/field looks something like:
> { contractID, dateTime, ... } with an OBT approach I think I would
> have to additionally add a field for performing a binary search
> within a contract: { index, contractID, dateTime,... } and would
> additionally need to have an external table indexing into OBT,
> identifying where each new contract begins (and perhaps more).  Does
> this sound right to you?  Do you have any suggestions on how to
> "point into" the OBT?

Yes, that's a perfectly good way to 'index' data for the OBT approach, 
IMO.  In order to 'point to' the OBT you may want to use either just 
start and end rows or HDF5 references to table hyperslabs (ranges of 
rows in this case), the approach you find more confortable for your 
needs.

Of course, you will need to re-compute your indexes after every table 
merge.  Having an integrated indexing engine would make things a lot 
easier, but the above approach is doable anyway.

At any rate, it would be really interesting to know which approach (OTPI 
vs OBT) finally works best for you.

Cheers,

Francesc

>
> Many thanks for your help and best wishes for a Merry Xmas and Happy
> Holidays,
>
>     Tito.
>
>
>
>
>
> ________________________________
> From: Francesc Alted <faltet at pytables.com>
> To: hdf-forum at hdfgroup.org
> Sent: Wednesday, December 24, 2008 7:58:33 AM
> Subject: Re: [hdf-forum] modelling tick data
>
> Hi Tito,
>
> A Tuesday 23 December 2008, Tito Ingargiola escrigué:
> > Hi,
> >
> > I'm trying to figure out how to best use hdf5 for my data.  I've
> > been experimenting with various options but there seem to be many,
> > many different ways to model things and no relevant examples that I
> > have come across.
> >
> > Below I describe the data and its primary use as well as some
> > questions about how I might most effectively model it within hdf5.
> > I'm using the C interface and, to the degree possible, would like
> > to use the HL interfaces as much as possible.  Utlimately, I will
> > also need to access this data via Java in some cases and believe
> > that my best bet is to write the storage and query code in C and
> > then use SWIG/JNI to access this via Java.  (This is based on
> > prototyping I've done and my assessment of the current Java hdf5
> > interface.)   Thus, using pytables doesn't seem applicable for my
> > circumstance.
>
> Don't discard PyTables so soon ;-)  You could use pydap [1] for
> serving PyTables files through the Data Access Protocol [2] and then
> using one of the DAP adapters [3] for your preferred language on the
> client side.
>
> [1] http://pydap.org/
> [2] http://opendap.org/
> [3] http://opendap.org/download/index.html
>
> In order to adapt pydap better to your needs, you could even modify
> the PyTables plugin (it is very easy to understand) for pydap and
> taylor it to your needs.
>
> Also, and in addition to the (excellent) advices that Ray has already
> given to you, look at my comments interspersed in your message.
>
> > I'll appreciate any responses, insights or pointers you might
> > provide.  Thanks and best wishes for the holidays,
> >
> >      Tito.
> >
> > --
> >
> > A description of the data and its use
> >
> > The data is all timestamped financial streams of "tick" data.  Each
> > record is small (a few hundred bytes at the most), but there are
> > many - in a day you may see many hundred million to a few billion. 
> > Each record is naturally partitioned by instrument (eg,
> > "microsoft", "ibm", "dec crude", etc).  There are less than 30K
> > instruments in the universe I might care about.
> >
> > I (more or less) don't care how long it takes to construct the h5
> > files/structures as it will be performed offline and the only
> > critical query I care about is something like:
> >
> >
> > "Get ticks for instruments {i1...in} from time t1 to time t2
> > ordered by time, instr".
> >
> > That is, I need to be able to "replay" a subset of the instruments
> > within the data store over  some period of time.  But I really care
> > that this be as fast as possible.
> >
> > Questions
> >
> > 0.  Am I barking up the wrong tree?  Is HDF5 an appropriate
> > technology for the use I've described?
>
> In my experience, HDF5 is perfectly appropriate to cope with this. 
> It is, as many other things, just a matter of properly directing it
> to do what you want to do ;-)  In particular, your volume of data
> seems that is going to be very large, so you should be very careful
> when choosing the different parameters for your application. 
> Remember that experimenting is your best friend before putting code
> into production.
>
> > 1. Given the size/volume of the data, my thought is to partition h5
> > files by day.  Uncompressed, the files will be on the order of
> > ~25G. Does this sound reasonable?  What are the key factors
> > impacting this decision from an hdf5 perspective?
>
> 25GB is a completely reasonable figure for a single file. Moreover,
> by using compression, you can reduce this even more, so I see not
> problem on this.  However, how data is organised inside the file
> becomes very important for handling data efficiently (see below).
>
> > Two alternative models come immediately to mind: one big table
> > (OBT) per day ordered by instrument and then time, or one table per
> > instrument (OTPI) ordered by time.  My current inclination is OTPI
> > as it seems more manageable assuming the overhead of so many tables
> > isn't an issue.
> >
> > 2a.  Are there other, better models you suggest I investigate?
> >
> > 2b.  With the OBT, I'd need to be able "index into" the table to
> > identify the beginning of each instrument's section (at least). 
> > How would you recommend doing this?  It seems possible to do this
> > with references or perhaps a separate table with numerical indices
> > into the main table.  Any pros/cons/alternatives to these
> > approaches?
> >
> > 2c.  With the OTPI, I'd need to have many tables (at most ~30K) per
> > file.  Is this an issue?
> >
> > 2d. For both models, I'd need to be able to merge sorted sets of h5
> > data into one sorted set as quickly as possible.  Is there any hdf5
> > support for doing such a thing or external libraries created for
> > this purpose?
>
> Effectively, both OTPI and OBT has its advantages and drawbacks. 
> OTPI, for me, has the disavantage of requiring a respectable amount
> of tables for your case (having 30000 datasets in a single file is
> nothing to sneeze at), as well as requiring a somewhat more
> complicated query code.  The OBT one is probably a simpler and better
> approach in that HDF5 has to deal with less metadata (the 30000
> datasets are avoided), so it is probably faster, but you may need
> additional logic in your programs to perform fast queries (indexing
> will help a lot indeed). I'd recommend doing your own experiments
> here.
>
> Regarding merging sorted datasets, you can always implement a typical
> merge sort for your needs.  Otherwise, you may want to use the
> sorting capabilities of PyTables Pro that can handle arbitrarily
> large tables.
>
> > 3. What impact on retrieval/querying should I expect to see with
> > varying levels of compression?
>
> See:
>
> http://www.pytables.org/docs/manual/ch05.html#searchOptim
>
> for some experiments that I've recently done on this.  They are meant
> for PyTables, but you could find interesting hints for your case too.
>
> > 4. Any suggestions on chunksizes for this application?
>
> See:
>
> http://www.pytables.org/docs/manual/ch05.html#chunksizeFineTune
>
> for more experiments in that regard.
>
> > Many thanks for any insights you might provide!
>
> Hope that helps, and Merry Christmas!



-- 
Francesc Alted

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe at hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe at hdfgroup.org.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.hdfgroup.org/pipermail/hdf-forum_hdfgroup.org/attachments/20090104/b46e751d/attachment.html>


More information about the Hdf-forum mailing list