Skip to content

high frequency data analysis in R

24 messages · Michael, Liviu Andronic, Hae Kyung Im +7 more

#
Hi all,

I am wondering if there are some special toolboxes to handle high
frequency data in R?

I have some high frequency data and was wondering what meaningful
experiments can I run on these high frequency data.

Not sure if normal (low frequency) financial time series textbook data
analysis tools will work for high frequency data?

Let's say I run a correlation between two stocks using the high
frequency data, or run an ARMA model on one stock, will the results be
meaningful?

Could anybody point me some classroom types of treatment or lab
tutorial type of document which show me what meaningful
experiments/tests I can run on high frequency data?

Thanks a lot!
#
Not my domain, but you will more than likely have to aggregate to some
sort of regular/homogenous type of series for most traditional tools
to work.

xts has to.period to aggregate up to a lower frequency from tick-level
data. Coupled with something like na.locf you can make yourself some
high frequency 'regular' data from 'irregular'

Regular and irregular of course depend on what you are looking at
(weekends missing in daily data can still be 'regular').

I'd be interested in hearing thoughts from those who actually tread in
the high-freq domain...

A wealth of information can be found here:

 http://www.olsen.ch/publications/working-papers/

Jeff
On Thu, May 21, 2009 at 10:04 AM, Michael <comtech.usa at gmail.com> wrote:

  
    
#
Thanks Jeff.

By high frequency I mean really the tick data. For example, during
peak time, the arrival of price events could be at about hundreds to
thousands within one second, irregularly spaced.

I've heard that forcing irregularly spaced data into regularly spaced
data(e.g. through interpolation) will lose information. How's that so?

Thanks!
On Thu, May 21, 2009 at 8:15 AM, Jeff Ryan <jeff.a.ryan at gmail.com> wrote:
#
My data are price change arrivals, irregularly spaced. But when there
is no price change, the price stays constant. Therefore, in fact, at
any time instant, you give me a time, I can give you the price at that
very instant of time. So irregularly spaced data can be easily sampled
to be regularly spaced data.
What do you think of this approach?
On Thu, May 21, 2009 at 8:21 AM, Michael <comtech.usa at gmail.com> wrote:
#
Hello Michael,
On Thu, May 21, 2009 at 5:21 PM, Michael <comtech.usa at gmail.com> wrote:
If I understand correctly, you're dealing with an issue---that I'm
currently investigating---of nonsynchronous data. You may be
interested in library(realized), which implements at least the
Hayashi-Yoshida covariance estimator (2005). Be sure to check the
package's homepage for an extended user manual and a (possibly
obsolete) table of implemented methods. There is also a paper dealing
with synchronizing data using a "Refresh Time" methodology
("Multivariate realised kernels: consistent positive semi-definite
estimators of the covariation of equity prices with noise and
non-synchronous trading", BARNDORFF-NIELSEN, HANSEN, LUNDE and
SHEPHARD, 2008).
high-frequency data; unfortunately I am dealing with very
low-frequency non-synchronous data, and I'm still looking for a data
synchronization method/consistent covariance estimator. If anyone is
familiar with available methodology/R implementations, please share
your thoughts.

Best,
Liviu
#
I think in general you would need some sort of pre-processing before using R.

You can use periodic sampling of prices, but you may be throwing away
a lot of information. This is a method that used to be recommended
more than 5 years ago in order to mitigate the effect of market noise.
At least in the context of volatility estimation.

Here is my experience with tick data:

I used FX data to calculate estimated daily volatility using TSRV
(Zhang et al 2005
http://galton.uchicago.edu/~mykland/paperlinks/p1394.pdf). Using the
time series of estimated daily volatilities, I forecasted volatilities
for 1 day up to 1 year ahead. The tick data was in Quantitative
Analytics database. I used their C++ API to query daily data, computed
the TSRV estimator in C++ and saved the result in text file. Then I
used R to read the estimated volatilities and used FARIMA to forecast
volatility. An interesting thing about this type of series is that the
fractional coefficient is approximately 0.4 in many instances.
Bollerslev has a paper commenting on this fact.

In another project, I had treasury futures market depth data. The data
came in plain text format, with one file per day. Each day had more
than 1 million entries. I don't think I could handle this with R. To
get started I decided to use only actual trades. I used Python to
filter out the trades. So this came down to ~60K entries per day. This
I could handle with R. I used to.period from xts package to aggregate
the data.

In order to handle market depth data, we need some efficient way to
access (query) this huge database. I looked a little bit into kdb but
you have to pay ~25K to buy the software for one processor. I haven't
been able to look more into this for now.

Haky
On Thu, May 21, 2009 at 10:15 AM, Jeff Ryan <jeff.a.ryan at gmail.com> wrote:
#
If there is a way to call R functions within from C++, that should
solve the large-data-set problem, right?
On the other hand, you only need to truncate data into smaller trunks,
for example, using SAS?
On Thu, May 21, 2009 at 9:13 AM, Hae Kyung Im <hakyim at gmail.com> wrote:
#
Could anybody comment on my approach of obtaining regularly spaced
data from irregularly spaced price changes, and then use R to process
them?
On Thu, May 21, 2009 at 8:38 AM, Michael <comtech.usa at gmail.com> wrote:
#
Not to distract from the underlying processing question, but to answer
the 'data' one:

The data in R should be too much of an issue, at least from a size perspective.

xts objects on the order of millions of observations are still fast
and memory friendly with respect to copying operations internal to
many xts calls (merge, subset, etc).
user  system elapsed
  0.037   0.015   0.053


7 million obs of a single column xts is ~54 Mb.  Certainly you can
handle quite a bit of data if you have anything more than trivial
amounts of RAM.

quantmod now has (devel) an attachSymbols function that makes
lazy-loading data very easy, so all your data can be stored as xts
objects and read in on-demand.

xts is also getting the ability to query subsets of data on disk, by
time.  This will have no practical limit.

For current data solutions xts, fts (C++), data.table, and some other
solutions should mitigate your problems, if not solve the 'data' side
all together.


HTH
Jeff
On Thu, May 21, 2009 at 11:13 AM, Hae Kyung Im <hakyim at gmail.com> wrote:

  
    
#
Relating the approach that turns irregular data into regular one,
I guess it's a complex question and how you approach it will depend on
the specific problem.

With your method, you would assume that the price is equal to the last
traded price or something like that. If there is no trade for some
time, would it make sense to say that the price is the last traded
price? If you wanted to actually buy/sell at that price, it's not
obvious that you'll be able to do so.

Also, if you only look at the time series of instantaneous prices, you
would be losing a lot of information about what happened in between
the time points. It makes more sense to aggregate and keep, for
example, open, high, low and close. Or some statistics on the
distribution of the prices between the endpoints.

If what you need to calculate is correlations, then I would look at
the papers that Liviu suggested. It seems that synchronicity is
critical. I heard there is an extension of TSRV to correlations.

If you only need to look at univariate time series, you may be able to
get away with your method more easily. It may not be statistically
efficient but may give you a good enough answer in some cases.


HTH
Haky
On Thu, May 21, 2009 at 10:38 AM, Michael <comtech.usa at gmail.com> wrote:
#
Jeff,

This is very impressive. Even on my Macbook Air it takes less than 0.2
seconds total.
user  system elapsed
  0.093   0.021   0.198
When you say stored, does it mean on disk or memory?
This would be great! Will we be able to append data to xts stored on disk?


Thanks
Haky
On Thu, May 21, 2009 at 11:23 AM, Jeff Ryan <jeff.a.ryan at gmail.com> wrote:
#
Haky,

My times are from a new R session on a MacBook 2.16, so yes it is fast.
On Thu, May 21, 2009 at 12:02 PM, Hae Kyung Im <hakyim at gmail.com> wrote:
attachSymbols can use disk or memory for caching, but the files are
read with getSymbols, so they can realistically be stored anywhere.
The docs provide at least a small introduction.

The tutorial I gave at R/Finance 2009 gives a small example as well.

http://www.RinFinance.com/presentations
The core issue is read or write optimized.  I lean toward read
optimization, so something akin to a column-based structure.  This
will make writes more costly, but that would be acceptable to me at
the moment.  Probably keep some sort of write structure => read
structure tool in the mix as well.

I will of course keep the list updated on progress here once it is
ready for release.
Thanks,
Jeff

  
    
#
On 21 May 2009 at 11:13, Hae Kyung Im wrote:
| access (query) this huge database. I looked a little bit into kdb but
| you have to pay ~25K to buy the software for one processor. I haven't

True, but you can have "free" (as in beer) 32bit version that times out after
two hours. That's not a bad compromise.    

I looked at it for a bit, and I has an R interface. (My blog has a patch to
fix their then-broken interface to R's Datetime; I think they may have
integrated that by now).  Then again you can also pre-process into RData
files, or use hdf5, or use a gazillion other methods.   But the free trial
version may just help for the odd research project like the one Haky
described. 

Dirk
#
High-frequency is not my specialty either, but a quote caught my attention:
On Thu, May 21, 2009 at 11:38 AM, Michael <comtech.usa at gmail.com> wrote:
outside of the instant a trade took place - you have NBBO (and market
depth). Last trade's price may or may not be transactable again on
either long or short side.

You can alternatively say that you have an instanteneous "mid-market
price" and a bid/ask spread to work with.

Correct me if I'm wrong - I'd like to know how people in HF really
look at their data.

-- ET.
#
Is there any literature on the relative performance gain of
preprocessing data into RData and then reading into R? Does it breakdown
anywhere? I have 4 GB of data that I'm reading in and I/O is a large
bottleneck.

Brian


-----Original Message-----
From: r-sig-finance-bounces at stat.math.ethz.ch
[mailto:r-sig-finance-bounces at stat.math.ethz.ch] On Behalf Of Dirk
Eddelbuettel
Sent: Thursday, May 21, 2009 1:42 PM
To: Hae Kyung Im
Cc: r-sig-finance at stat.math.ethz.ch
Subject: [R-SIG-Finance] Kdb (Was: high frequency data analysis in R)
On 21 May 2009 at 11:13, Hae Kyung Im wrote:
| access (query) this huge database. I looked a little bit into kdb but
| you have to pay ~25K to buy the software for one processor. I haven't

True, but you can have "free" (as in beer) 32bit version that times out
after
two hours. That's not a bad compromise.    

I looked at it for a bit, and I has an R interface. (My blog has a patch
to
fix their then-broken interface to R's Datetime; I think they may have
integrated that by now).  Then again you can also pre-process into RData
files, or use hdf5, or use a gazillion other methods.   But the free
trial
version may just help for the odd research project like the one Haky
described. 

Dirk
#
I feel like I should change the title again... :)

The RData files are compressed first off. If you don't want the gzip
overhead, get rid of it.

The xts format 'on-disk' is nothing more that the structure from
memory written to disk.  This manages to be both faster and takes up
less space.  It isn't a huge gain, but it allows for binary searching
of the index to get to the data you want.

I will put together a performance comparison at some point, and pass along.

Jeff

On Thu, May 21, 2009 at 12:52 PM, Rowe, Brian Lee Yung (Portfolio
Analytics) <B_Rowe at ml.com> wrote:

  
    
#
... and possibly even a list change.

Do you plan on making this compatible with ff or bigmemory? Seems like this theme is making its rounds.

Brian 

-----Original Message-----
From: Jeff Ryan [mailto:jeff.a.ryan at gmail.com] 
Sent: Thursday, May 21, 2009 1:58 PM
To: Rowe, Brian Lee Yung (Portfolio Analytics)
Cc: Dirk Eddelbuettel; Hae Kyung Im; r-sig-finance at stat.math.ethz.ch
Subject: Re: [R-SIG-Finance] Preprocessing RData file (Was: Kdb (Was: high frequency data analysis in R))


I feel like I should change the title again... :)

The RData files are compressed first off. If you don't want the gzip
overhead, get rid of it.

The xts format 'on-disk' is nothing more that the structure from
memory written to disk.  This manages to be both faster and takes up
less space.  It isn't a huge gain, but it allows for binary searching
of the index to get to the data you want.

I will put together a performance comparison at some point, and pass along.

Jeff

On Thu, May 21, 2009 at 12:52 PM, Rowe, Brian Lee Yung (Portfolio
Analytics) <B_Rowe at ml.com> wrote:

  
    
#
Some resources:

If you want to deal with irregular data, Eric Zivot's book on
financial time series mentions some operators that are available in
S-Plus based on this Zumback/Muller paper:
http://papers.ssrn.com/sol3/papers.cfm?abstract_id=208278.  These
could be easily adapted to R.  Ruey Tsay's book also has a chapter
that touches on it.  In general terms, Dacorogna et. al. is a good
overview.

And as already mentioned, definitely look at the realized package:
http://students.washington.edu/spayseur/realized/.
On Thu, May 21, 2009 at 12:51 PM, Hae Kyung Im <hakyim at gmail.com> wrote:
#
I have given some thought to both ff and bigmemory.  I am not a huge
fan of the "ff" license.

http://cran.r-project.org/web/packages/ff/LICENSE

bigmemory is interesting in that you can bypass the R memory issues on
Windows, but I haven't had incredible luck with it.  Me and C++ don't
like each other, so maybe it is something related to that :).  I can
get around the Windows issues by using something non-windows...

Supposedly the changes to the most recent bigmemory are quite good,
but trying the shared memory route (one of the biggest reasons I would
like to use) has caused me catastrophic failure.

At the end of the day there is no good way to make xts rely on
bigmemory. As so much code in in C for xts, you can't readily operate
on the external pointers from there.  You need to read in the data
(via `[` ) and at that point it is resident to the R process, so you
are only getting the penalty of the memory allocation, and none of the
gain.

Of course this is my 2c.  Maybe we need another time-series library :)

Jeff




On Thu, May 21, 2009 at 1:10 PM, Rowe, Brian Lee Yung (Portfolio
Analytics) <B_Rowe at ml.com> wrote:

  
    
#
I failed to point out that data.table can make use of both (?) those packages.

It isn't a time-series library per se, but it make one very cool
in-memory database.  Similar in spirit to some of the not-so-free ones
out there...

Jeff
On Thu, May 21, 2009 at 1:22 PM, Jeff Ryan <jeff.a.ryan at gmail.com> wrote:

  
    
#
In fact, I have the whole jump processes of best bid, and best ask, at
a continuous level (in the sense of time-stamped arrival data), and
also the jump process of the last trade price, at a continuous level
(in the sense of time-stamped arrival data). Any more thoughts?
On Thu, May 21, 2009 at 9:51 AM, Hae Kyung Im <hakyim at gmail.com> wrote:
#
fts underneath is a c++ policy template based class.

The underlying storage can be pretty much anything as long as you
implement the public api.

here is the backend for R:
http://github.com/armstrtw/r.tslib.backend/tree/master

the backend could just as easily be a python object allocator, a
dataframe, or a flat file.

-Whit
On Thu, May 21, 2009 at 2:22 PM, Jeff Ryan <jeff.a.ryan at gmail.com> wrote:
#
I'm new to R and interested in working with large amounts of data
(timeseries, but regularly spaced.) Can you point me to a good reference for
using data.table with bigmemory or ff? 

(I'm a bit puzzled about what exactly these packages provide. As I
understand it, on 32-bit platforms files are subject to the same 2GB limit
as in-process memory, so I assume that dealing with a larger dataset still
requires breaking it up into multiple files...)

Thanks for your help.



I failed to point out that data.table can make use of both (?) those
packages. [ff, bigmemory]

It isn't a time-series library per se, but it make one very cool
in-memory database.  Similar in spirit to some of the not-so-free ones
out there...

Jeff
#
http://finzi.psych.upenn.edu/R/Rhelp08/2009-March/193490.html

Further information on 'ff' and 'bigmemory' is covered in those
respective packages.

As far as combining the two/three, I would wait to hear back from Matt
on exactly how to do that.  I thought there was an example somewhere
if I recall...

The main advantage to using large datasets in RAM is simply
efficiency.  'ff' makes that process manageable without a lot of RAM,
bigmemory can bypass single process limits of R (and do some cool
memory sharing).  The advantage to both is really confined to 32bit
processing, if I am thinking straight.

This is probably more of a question for R-help at this point, or even
R-Sig-db though, as the 'finance' part is only tangential.

If you can break the data up with something like a db scheme, then xts
will be faster than all (?) the other solutions for in-memory
manipulation -- as it is time-series oriented.  And if you've got
64bits and lots of RAM it should do most of what you need.

HTH
Jeff
On Fri, May 22, 2009 at 9:18 AM, Steve Jaffe <sjaffe at riskspan.com> wrote: