Hi all, I am wondering if there are some special toolboxes to handle high frequency data in R? I have some high frequency data and was wondering what meaningful experiments can I run on these high frequency data. Not sure if normal (low frequency) financial time series textbook data analysis tools will work for high frequency data? Let's say I run a correlation between two stocks using the high frequency data, or run an ARMA model on one stock, will the results be meaningful? Could anybody point me some classroom types of treatment or lab tutorial type of document which show me what meaningful experiments/tests I can run on high frequency data? Thanks a lot!
high frequency data analysis in R
24 messages · Michael, Liviu Andronic, Hae Kyung Im +7 more
Not my domain, but you will more than likely have to aggregate to some sort of regular/homogenous type of series for most traditional tools to work. xts has to.period to aggregate up to a lower frequency from tick-level data. Coupled with something like na.locf you can make yourself some high frequency 'regular' data from 'irregular' Regular and irregular of course depend on what you are looking at (weekends missing in daily data can still be 'regular'). I'd be interested in hearing thoughts from those who actually tread in the high-freq domain... A wealth of information can be found here: http://www.olsen.ch/publications/working-papers/ Jeff
On Thu, May 21, 2009 at 10:04 AM, Michael <comtech.usa at gmail.com> wrote:
Hi all, I am wondering if there are some special toolboxes to handle high frequency data in R? I have some high frequency data and was wondering what meaningful experiments can I run on these high frequency data. Not sure if normal (low frequency) financial time series textbook data analysis tools will work for high frequency data? Let's say I run a correlation between two stocks using the high frequency data, or run an ARMA model on one stock, will the results be meaningful? Could anybody point me some classroom types of treatment or lab tutorial type of document which show me what meaningful experiments/tests I can run on high frequency data? Thanks a lot!
_______________________________________________ R-SIG-Finance at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-finance -- Subscriber-posting only. -- If you want to post, subscribe first.
Jeffrey Ryan jeffrey.ryan at insightalgo.com ia: insight algorithmics www.insightalgo.com
Thanks Jeff. By high frequency I mean really the tick data. For example, during peak time, the arrival of price events could be at about hundreds to thousands within one second, irregularly spaced. I've heard that forcing irregularly spaced data into regularly spaced data(e.g. through interpolation) will lose information. How's that so? Thanks!
On Thu, May 21, 2009 at 8:15 AM, Jeff Ryan <jeff.a.ryan at gmail.com> wrote:
Not my domain, but you will more than likely have to aggregate to some sort of regular/homogenous type of series for most traditional tools to work. xts has to.period to aggregate up to a lower frequency from tick-level data. Coupled with something like na.locf you can make yourself some high frequency 'regular' data from 'irregular' Regular and irregular of course depend on what you are looking at (weekends missing in daily data can still be 'regular'). I'd be interested in hearing thoughts from those who actually tread in the high-freq domain... A wealth of information can be found here: ?http://www.olsen.ch/publications/working-papers/ Jeff On Thu, May 21, 2009 at 10:04 AM, Michael <comtech.usa at gmail.com> wrote:
Hi all, I am wondering if there are some special toolboxes to handle high frequency data in R? I have some high frequency data and was wondering what meaningful experiments can I run on these high frequency data. Not sure if normal (low frequency) financial time series textbook data analysis tools will work for high frequency data? Let's say I run a correlation between two stocks using the high frequency data, or run an ARMA model on one stock, will the results be meaningful? Could anybody point me some classroom types of treatment or lab tutorial type of document which show me what meaningful experiments/tests I can run on high frequency data? Thanks a lot!
_______________________________________________ R-SIG-Finance at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-finance -- Subscriber-posting only. -- If you want to post, subscribe first.
-- Jeffrey Ryan jeffrey.ryan at insightalgo.com ia: insight algorithmics www.insightalgo.com
My data are price change arrivals, irregularly spaced. But when there is no price change, the price stays constant. Therefore, in fact, at any time instant, you give me a time, I can give you the price at that very instant of time. So irregularly spaced data can be easily sampled to be regularly spaced data. What do you think of this approach?
On Thu, May 21, 2009 at 8:21 AM, Michael <comtech.usa at gmail.com> wrote:
Thanks Jeff. By high frequency I mean really the tick data. For example, during peak time, the arrival of price events could be at about hundreds to thousands within one second, irregularly spaced. I've heard that forcing irregularly spaced data into regularly spaced data(e.g. through interpolation) will lose information. How's that so? Thanks! On Thu, May 21, 2009 at 8:15 AM, Jeff Ryan <jeff.a.ryan at gmail.com> wrote:
Not my domain, but you will more than likely have to aggregate to some sort of regular/homogenous type of series for most traditional tools to work. xts has to.period to aggregate up to a lower frequency from tick-level data. Coupled with something like na.locf you can make yourself some high frequency 'regular' data from 'irregular' Regular and irregular of course depend on what you are looking at (weekends missing in daily data can still be 'regular'). I'd be interested in hearing thoughts from those who actually tread in the high-freq domain... A wealth of information can be found here: ?http://www.olsen.ch/publications/working-papers/ Jeff On Thu, May 21, 2009 at 10:04 AM, Michael <comtech.usa at gmail.com> wrote:
Hi all, I am wondering if there are some special toolboxes to handle high frequency data in R? I have some high frequency data and was wondering what meaningful experiments can I run on these high frequency data. Not sure if normal (low frequency) financial time series textbook data analysis tools will work for high frequency data? Let's say I run a correlation between two stocks using the high frequency data, or run an ARMA model on one stock, will the results be meaningful? Could anybody point me some classroom types of treatment or lab tutorial type of document which show me what meaningful experiments/tests I can run on high frequency data? Thanks a lot!
_______________________________________________ R-SIG-Finance at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-finance -- Subscriber-posting only. -- If you want to post, subscribe first.
-- Jeffrey Ryan jeffrey.ryan at insightalgo.com ia: insight algorithmics www.insightalgo.com
Hello Michael,
On Thu, May 21, 2009 at 5:21 PM, Michael <comtech.usa at gmail.com> wrote:
By high frequency I mean really the tick data. For example, during peak time, the arrival of price events could be at about hundreds to thousands within one second, irregularly spaced.
If I understand correctly, you're dealing with an issue---that I'm
currently investigating---of nonsynchronous data. You may be
interested in library(realized), which implements at least the
Hayashi-Yoshida covariance estimator (2005). Be sure to check the
package's homepage for an extended user manual and a (possibly
obsolete) table of implemented methods. There is also a paper dealing
with synchronizing data using a "Refresh Time" methodology
("Multivariate realised kernels: consistent positive semi-definite
estimators of the covariation of equity prices with noise and
non-synchronous trading", BARNDORFF-NIELSEN, HANSEN, LUNDE and
SHEPHARD, 2008).
From what I understood the HY estimator is appropriate for very
high-frequency data; unfortunately I am dealing with very low-frequency non-synchronous data, and I'm still looking for a data synchronization method/consistent covariance estimator. If anyone is familiar with available methodology/R implementations, please share your thoughts. Best, Liviu
I think in general you would need some sort of pre-processing before using R. You can use periodic sampling of prices, but you may be throwing away a lot of information. This is a method that used to be recommended more than 5 years ago in order to mitigate the effect of market noise. At least in the context of volatility estimation. Here is my experience with tick data: I used FX data to calculate estimated daily volatility using TSRV (Zhang et al 2005 http://galton.uchicago.edu/~mykland/paperlinks/p1394.pdf). Using the time series of estimated daily volatilities, I forecasted volatilities for 1 day up to 1 year ahead. The tick data was in Quantitative Analytics database. I used their C++ API to query daily data, computed the TSRV estimator in C++ and saved the result in text file. Then I used R to read the estimated volatilities and used FARIMA to forecast volatility. An interesting thing about this type of series is that the fractional coefficient is approximately 0.4 in many instances. Bollerslev has a paper commenting on this fact. In another project, I had treasury futures market depth data. The data came in plain text format, with one file per day. Each day had more than 1 million entries. I don't think I could handle this with R. To get started I decided to use only actual trades. I used Python to filter out the trades. So this came down to ~60K entries per day. This I could handle with R. I used to.period from xts package to aggregate the data. In order to handle market depth data, we need some efficient way to access (query) this huge database. I looked a little bit into kdb but you have to pay ~25K to buy the software for one processor. I haven't been able to look more into this for now. Haky
On Thu, May 21, 2009 at 10:15 AM, Jeff Ryan <jeff.a.ryan at gmail.com> wrote:
Not my domain, but you will more than likely have to aggregate to some sort of regular/homogenous type of series for most traditional tools to work. xts has to.period to aggregate up to a lower frequency from tick-level data. Coupled with something like na.locf you can make yourself some high frequency 'regular' data from 'irregular' Regular and irregular of course depend on what you are looking at (weekends missing in daily data can still be 'regular'). I'd be interested in hearing thoughts from those who actually tread in the high-freq domain... A wealth of information can be found here: ?http://www.olsen.ch/publications/working-papers/ Jeff On Thu, May 21, 2009 at 10:04 AM, Michael <comtech.usa at gmail.com> wrote:
Hi all, I am wondering if there are some special toolboxes to handle high frequency data in R? I have some high frequency data and was wondering what meaningful experiments can I run on these high frequency data. Not sure if normal (low frequency) financial time series textbook data analysis tools will work for high frequency data? Let's say I run a correlation between two stocks using the high frequency data, or run an ARMA model on one stock, will the results be meaningful? Could anybody point me some classroom types of treatment or lab tutorial type of document which show me what meaningful experiments/tests I can run on high frequency data? Thanks a lot!
_______________________________________________ R-SIG-Finance at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-finance -- Subscriber-posting only. -- If you want to post, subscribe first.
-- Jeffrey Ryan jeffrey.ryan at insightalgo.com ia: insight algorithmics www.insightalgo.com
_______________________________________________ R-SIG-Finance at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-finance -- Subscriber-posting only. -- If you want to post, subscribe first.
If there is a way to call R functions within from C++, that should solve the large-data-set problem, right? On the other hand, you only need to truncate data into smaller trunks, for example, using SAS?
On Thu, May 21, 2009 at 9:13 AM, Hae Kyung Im <hakyim at gmail.com> wrote:
I think in general you would need some sort of pre-processing before using R. You can use periodic sampling of prices, but you may be throwing away a lot of information. This is a method that used to be recommended more than 5 years ago in order to mitigate the effect of market noise. At least in the context of volatility estimation. Here is my experience with tick data: I used FX data to calculate estimated daily volatility using TSRV (Zhang et al 2005 http://galton.uchicago.edu/~mykland/paperlinks/p1394.pdf). Using the time series of estimated daily volatilities, I forecasted volatilities for 1 day up to 1 year ahead. The tick data was in Quantitative Analytics database. I used their C++ API to query daily data, computed the TSRV estimator in C++ and saved the result in text file. Then I used R to read the estimated volatilities and used FARIMA to forecast volatility. An interesting thing about this type of series is that the fractional coefficient is approximately 0.4 in many instances. Bollerslev has a paper commenting on this fact. In another project, I had treasury futures market depth data. The data came in plain text format, with one file per day. Each day had more than 1 million entries. I don't think I could handle this with R. To get started I decided to use only actual trades. I used Python to filter out the trades. So this came down to ~60K entries per day. This I could handle with R. I used to.period from xts package to aggregate the data. In order to handle market depth data, we need some efficient way to access (query) this huge database. I looked a little bit into kdb but you have to pay ~25K to buy the software for one processor. I haven't been able to look more into this for now. Haky On Thu, May 21, 2009 at 10:15 AM, Jeff Ryan <jeff.a.ryan at gmail.com> wrote:
Not my domain, but you will more than likely have to aggregate to some sort of regular/homogenous type of series for most traditional tools to work. xts has to.period to aggregate up to a lower frequency from tick-level data. Coupled with something like na.locf you can make yourself some high frequency 'regular' data from 'irregular' Regular and irregular of course depend on what you are looking at (weekends missing in daily data can still be 'regular'). I'd be interested in hearing thoughts from those who actually tread in the high-freq domain... A wealth of information can be found here: ?http://www.olsen.ch/publications/working-papers/ Jeff On Thu, May 21, 2009 at 10:04 AM, Michael <comtech.usa at gmail.com> wrote:
Hi all, I am wondering if there are some special toolboxes to handle high frequency data in R? I have some high frequency data and was wondering what meaningful experiments can I run on these high frequency data. Not sure if normal (low frequency) financial time series textbook data analysis tools will work for high frequency data? Let's say I run a correlation between two stocks using the high frequency data, or run an ARMA model on one stock, will the results be meaningful? Could anybody point me some classroom types of treatment or lab tutorial type of document which show me what meaningful experiments/tests I can run on high frequency data? Thanks a lot!
_______________________________________________ R-SIG-Finance at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-finance -- Subscriber-posting only. -- If you want to post, subscribe first.
-- Jeffrey Ryan jeffrey.ryan at insightalgo.com ia: insight algorithmics www.insightalgo.com
_______________________________________________ R-SIG-Finance at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-finance -- Subscriber-posting only. -- If you want to post, subscribe first.
Could anybody comment on my approach of obtaining regularly spaced data from irregularly spaced price changes, and then use R to process them?
On Thu, May 21, 2009 at 8:38 AM, Michael <comtech.usa at gmail.com> wrote:
My data are price change arrivals, irregularly spaced. But when there is no price change, the price stays constant. Therefore, in fact, at any time instant, you give me a time, I can give you the price at that very instant of time. So irregularly spaced data can be easily sampled to be regularly spaced data. What do you think of this approach? On Thu, May 21, 2009 at 8:21 AM, Michael <comtech.usa at gmail.com> wrote:
Thanks Jeff. By high frequency I mean really the tick data. For example, during peak time, the arrival of price events could be at about hundreds to thousands within one second, irregularly spaced. I've heard that forcing irregularly spaced data into regularly spaced data(e.g. through interpolation) will lose information. How's that so? Thanks! On Thu, May 21, 2009 at 8:15 AM, Jeff Ryan <jeff.a.ryan at gmail.com> wrote:
Not my domain, but you will more than likely have to aggregate to some sort of regular/homogenous type of series for most traditional tools to work. xts has to.period to aggregate up to a lower frequency from tick-level data. Coupled with something like na.locf you can make yourself some high frequency 'regular' data from 'irregular' Regular and irregular of course depend on what you are looking at (weekends missing in daily data can still be 'regular'). I'd be interested in hearing thoughts from those who actually tread in the high-freq domain... A wealth of information can be found here: ?http://www.olsen.ch/publications/working-papers/ Jeff On Thu, May 21, 2009 at 10:04 AM, Michael <comtech.usa at gmail.com> wrote:
Hi all, I am wondering if there are some special toolboxes to handle high frequency data in R? I have some high frequency data and was wondering what meaningful experiments can I run on these high frequency data. Not sure if normal (low frequency) financial time series textbook data analysis tools will work for high frequency data? Let's say I run a correlation between two stocks using the high frequency data, or run an ARMA model on one stock, will the results be meaningful? Could anybody point me some classroom types of treatment or lab tutorial type of document which show me what meaningful experiments/tests I can run on high frequency data? Thanks a lot!
_______________________________________________ R-SIG-Finance at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-finance -- Subscriber-posting only. -- If you want to post, subscribe first.
-- Jeffrey Ryan jeffrey.ryan at insightalgo.com ia: insight algorithmics www.insightalgo.com
Not to distract from the underlying processing question, but to answer the 'data' one: The data in R should be too much of an issue, at least from a size perspective. xts objects on the order of millions of observations are still fast and memory friendly with respect to copying operations internal to many xts calls (merge, subset, etc).
x <- .xts(1:1e6, 1:1e6) system.time(merge(x,x))
user system elapsed 0.037 0.015 0.053 7 million obs of a single column xts is ~54 Mb. Certainly you can handle quite a bit of data if you have anything more than trivial amounts of RAM. quantmod now has (devel) an attachSymbols function that makes lazy-loading data very easy, so all your data can be stored as xts objects and read in on-demand. xts is also getting the ability to query subsets of data on disk, by time. This will have no practical limit. For current data solutions xts, fts (C++), data.table, and some other solutions should mitigate your problems, if not solve the 'data' side all together. HTH Jeff
On Thu, May 21, 2009 at 11:13 AM, Hae Kyung Im <hakyim at gmail.com> wrote:
I think in general you would need some sort of pre-processing before using R. You can use periodic sampling of prices, but you may be throwing away a lot of information. This is a method that used to be recommended more than 5 years ago in order to mitigate the effect of market noise. At least in the context of volatility estimation. Here is my experience with tick data: I used FX data to calculate estimated daily volatility using TSRV (Zhang et al 2005 http://galton.uchicago.edu/~mykland/paperlinks/p1394.pdf). Using the time series of estimated daily volatilities, I forecasted volatilities for 1 day up to 1 year ahead. The tick data was in Quantitative Analytics database. I used their C++ API to query daily data, computed the TSRV estimator in C++ and saved the result in text file. Then I used R to read the estimated volatilities and used FARIMA to forecast volatility. An interesting thing about this type of series is that the fractional coefficient is approximately 0.4 in many instances. Bollerslev has a paper commenting on this fact. In another project, I had treasury futures market depth data. The data came in plain text format, with one file per day. Each day had more than 1 million entries. I don't think I could handle this with R. To get started I decided to use only actual trades. I used Python to filter out the trades. So this came down to ~60K entries per day. This I could handle with R. I used to.period from xts package to aggregate the data. In order to handle market depth data, we need some efficient way to access (query) this huge database. I looked a little bit into kdb but you have to pay ~25K to buy the software for one processor. I haven't been able to look more into this for now. Haky On Thu, May 21, 2009 at 10:15 AM, Jeff Ryan <jeff.a.ryan at gmail.com> wrote:
Not my domain, but you will more than likely have to aggregate to some sort of regular/homogenous type of series for most traditional tools to work. xts has to.period to aggregate up to a lower frequency from tick-level data. Coupled with something like na.locf you can make yourself some high frequency 'regular' data from 'irregular' Regular and irregular of course depend on what you are looking at (weekends missing in daily data can still be 'regular'). I'd be interested in hearing thoughts from those who actually tread in the high-freq domain... A wealth of information can be found here: ?http://www.olsen.ch/publications/working-papers/ Jeff On Thu, May 21, 2009 at 10:04 AM, Michael <comtech.usa at gmail.com> wrote:
Hi all, I am wondering if there are some special toolboxes to handle high frequency data in R? I have some high frequency data and was wondering what meaningful experiments can I run on these high frequency data. Not sure if normal (low frequency) financial time series textbook data analysis tools will work for high frequency data? Let's say I run a correlation between two stocks using the high frequency data, or run an ARMA model on one stock, will the results be meaningful? Could anybody point me some classroom types of treatment or lab tutorial type of document which show me what meaningful experiments/tests I can run on high frequency data? Thanks a lot!
_______________________________________________ R-SIG-Finance at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-finance -- Subscriber-posting only. -- If you want to post, subscribe first.
-- Jeffrey Ryan jeffrey.ryan at insightalgo.com ia: insight algorithmics www.insightalgo.com
_______________________________________________ R-SIG-Finance at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-finance -- Subscriber-posting only. -- If you want to post, subscribe first.
Jeffrey Ryan jeffrey.ryan at insightalgo.com ia: insight algorithmics www.insightalgo.com
Relating the approach that turns irregular data into regular one, I guess it's a complex question and how you approach it will depend on the specific problem. With your method, you would assume that the price is equal to the last traded price or something like that. If there is no trade for some time, would it make sense to say that the price is the last traded price? If you wanted to actually buy/sell at that price, it's not obvious that you'll be able to do so. Also, if you only look at the time series of instantaneous prices, you would be losing a lot of information about what happened in between the time points. It makes more sense to aggregate and keep, for example, open, high, low and close. Or some statistics on the distribution of the prices between the endpoints. If what you need to calculate is correlations, then I would look at the papers that Liviu suggested. It seems that synchronicity is critical. I heard there is an extension of TSRV to correlations. If you only need to look at univariate time series, you may be able to get away with your method more easily. It may not be statistically efficient but may give you a good enough answer in some cases. HTH Haky
On Thu, May 21, 2009 at 10:38 AM, Michael <comtech.usa at gmail.com> wrote:
My data are price change arrivals, irregularly spaced. But when there is no price change, the price stays constant. Therefore, in fact, at any time instant, you give me a time, I can give you the price at that very instant of time. So irregularly spaced data can be easily sampled to be regularly spaced data. What do you think of this approach? On Thu, May 21, 2009 at 8:21 AM, Michael <comtech.usa at gmail.com> wrote:
Thanks Jeff. By high frequency I mean really the tick data. For example, during peak time, the arrival of price events could be at about hundreds to thousands within one second, irregularly spaced. I've heard that forcing irregularly spaced data into regularly spaced data(e.g. through interpolation) will lose information. How's that so? Thanks! On Thu, May 21, 2009 at 8:15 AM, Jeff Ryan <jeff.a.ryan at gmail.com> wrote:
Not my domain, but you will more than likely have to aggregate to some sort of regular/homogenous type of series for most traditional tools to work. xts has to.period to aggregate up to a lower frequency from tick-level data. Coupled with something like na.locf you can make yourself some high frequency 'regular' data from 'irregular' Regular and irregular of course depend on what you are looking at (weekends missing in daily data can still be 'regular'). I'd be interested in hearing thoughts from those who actually tread in the high-freq domain... A wealth of information can be found here: ?http://www.olsen.ch/publications/working-papers/ Jeff On Thu, May 21, 2009 at 10:04 AM, Michael <comtech.usa at gmail.com> wrote:
Hi all, I am wondering if there are some special toolboxes to handle high frequency data in R? I have some high frequency data and was wondering what meaningful experiments can I run on these high frequency data. Not sure if normal (low frequency) financial time series textbook data analysis tools will work for high frequency data? Let's say I run a correlation between two stocks using the high frequency data, or run an ARMA model on one stock, will the results be meaningful? Could anybody point me some classroom types of treatment or lab tutorial type of document which show me what meaningful experiments/tests I can run on high frequency data? Thanks a lot!
_______________________________________________ R-SIG-Finance at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-finance -- Subscriber-posting only. -- If you want to post, subscribe first.
-- Jeffrey Ryan jeffrey.ryan at insightalgo.com ia: insight algorithmics www.insightalgo.com
_______________________________________________ R-SIG-Finance at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-finance -- Subscriber-posting only. -- If you want to post, subscribe first.
Jeff, This is very impressive. Even on my Macbook Air it takes less than 0.2 seconds total.
x <- .xts(1:1e6, 1:1e6) system.time(merge(x,x))
user system elapsed 0.093 0.021 0.198
quantmod now has (devel) an attachSymbols function that makes lazy-loading data very easy, so all your data can be stored as xts objects and read in on-demand.
When you say stored, does it mean on disk or memory?
xts is also getting the ability to query subsets of data on disk, by time. This will have no practical limit.
This would be great! Will we be able to append data to xts stored on disk? Thanks Haky
On Thu, May 21, 2009 at 11:23 AM, Jeff Ryan <jeff.a.ryan at gmail.com> wrote:
Not to distract from the underlying processing question, but to answer the 'data' one: The data in R should be too much of an issue, at least from a size perspective. xts objects on the order of millions of observations are still fast and memory friendly with respect to copying operations internal to many xts calls (merge, subset, etc).
x <- .xts(1:1e6, 1:1e6) system.time(merge(x,x))
? user ?system elapsed ?0.037 ? 0.015 ? 0.053 7 million obs of a single column xts is ~54 Mb. ?Certainly you can handle quite a bit of data if you have anything more than trivial amounts of RAM. quantmod now has (devel) an attachSymbols function that makes lazy-loading data very easy, so all your data can be stored as xts objects and read in on-demand. xts is also getting the ability to query subsets of data on disk, by time. ?This will have no practical limit. For current data solutions xts, fts (C++), data.table, and some other solutions should mitigate your problems, if not solve the 'data' side all together. HTH Jeff On Thu, May 21, 2009 at 11:13 AM, Hae Kyung Im <hakyim at gmail.com> wrote:
I think in general you would need some sort of pre-processing before using R. You can use periodic sampling of prices, but you may be throwing away a lot of information. This is a method that used to be recommended more than 5 years ago in order to mitigate the effect of market noise. At least in the context of volatility estimation. Here is my experience with tick data: I used FX data to calculate estimated daily volatility using TSRV (Zhang et al 2005 http://galton.uchicago.edu/~mykland/paperlinks/p1394.pdf). Using the time series of estimated daily volatilities, I forecasted volatilities for 1 day up to 1 year ahead. The tick data was in Quantitative Analytics database. I used their C++ API to query daily data, computed the TSRV estimator in C++ and saved the result in text file. Then I used R to read the estimated volatilities and used FARIMA to forecast volatility. An interesting thing about this type of series is that the fractional coefficient is approximately 0.4 in many instances. Bollerslev has a paper commenting on this fact. In another project, I had treasury futures market depth data. The data came in plain text format, with one file per day. Each day had more than 1 million entries. I don't think I could handle this with R. To get started I decided to use only actual trades. I used Python to filter out the trades. So this came down to ~60K entries per day. This I could handle with R. I used to.period from xts package to aggregate the data. In order to handle market depth data, we need some efficient way to access (query) this huge database. I looked a little bit into kdb but you have to pay ~25K to buy the software for one processor. I haven't been able to look more into this for now. Haky On Thu, May 21, 2009 at 10:15 AM, Jeff Ryan <jeff.a.ryan at gmail.com> wrote:
Not my domain, but you will more than likely have to aggregate to some sort of regular/homogenous type of series for most traditional tools to work. xts has to.period to aggregate up to a lower frequency from tick-level data. Coupled with something like na.locf you can make yourself some high frequency 'regular' data from 'irregular' Regular and irregular of course depend on what you are looking at (weekends missing in daily data can still be 'regular'). I'd be interested in hearing thoughts from those who actually tread in the high-freq domain... A wealth of information can be found here: ?http://www.olsen.ch/publications/working-papers/ Jeff On Thu, May 21, 2009 at 10:04 AM, Michael <comtech.usa at gmail.com> wrote:
Hi all, I am wondering if there are some special toolboxes to handle high frequency data in R? I have some high frequency data and was wondering what meaningful experiments can I run on these high frequency data. Not sure if normal (low frequency) financial time series textbook data analysis tools will work for high frequency data? Let's say I run a correlation between two stocks using the high frequency data, or run an ARMA model on one stock, will the results be meaningful? Could anybody point me some classroom types of treatment or lab tutorial type of document which show me what meaningful experiments/tests I can run on high frequency data? Thanks a lot!
_______________________________________________ R-SIG-Finance at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-finance -- Subscriber-posting only. -- If you want to post, subscribe first.
-- Jeffrey Ryan jeffrey.ryan at insightalgo.com ia: insight algorithmics www.insightalgo.com
_______________________________________________ R-SIG-Finance at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-finance -- Subscriber-posting only. -- If you want to post, subscribe first.
-- Jeffrey Ryan jeffrey.ryan at insightalgo.com ia: insight algorithmics www.insightalgo.com
Haky, My times are from a new R session on a MacBook 2.16, so yes it is fast.
On Thu, May 21, 2009 at 12:02 PM, Hae Kyung Im <hakyim at gmail.com> wrote:
Jeff, This is very impressive. Even on my Macbook Air it takes less than 0.2 seconds total.
x <- .xts(1:1e6, 1:1e6) system.time(merge(x,x))
? user ?system elapsed ?0.093 ? 0.021 ? 0.198
quantmod now has (devel) an attachSymbols function that makes lazy-loading data very easy, so all your data can be stored as xts objects and read in on-demand.
When you say stored, does it mean on disk or memory?
attachSymbols can use disk or memory for caching, but the files are read with getSymbols, so they can realistically be stored anywhere. The docs provide at least a small introduction. The tutorial I gave at R/Finance 2009 gives a small example as well. http://www.RinFinance.com/presentations
xts is also getting the ability to query subsets of data on disk, by time. ?This will have no practical limit.
This would be great! Will we be able to append data to xts stored on disk?
The core issue is read or write optimized. I lean toward read optimization, so something akin to a column-based structure. This will make writes more costly, but that would be acceptable to me at the moment. Probably keep some sort of write structure => read structure tool in the mix as well. I will of course keep the list updated on progress here once it is ready for release.
Thanks Haky
Thanks, Jeff
On Thu, May 21, 2009 at 11:23 AM, Jeff Ryan <jeff.a.ryan at gmail.com> wrote:
Not to distract from the underlying processing question, but to answer the 'data' one: The data in R should be too much of an issue, at least from a size perspective. xts objects on the order of millions of observations are still fast and memory friendly with respect to copying operations internal to many xts calls (merge, subset, etc).
x <- .xts(1:1e6, 1:1e6) system.time(merge(x,x))
? user ?system elapsed ?0.037 ? 0.015 ? 0.053 7 million obs of a single column xts is ~54 Mb. ?Certainly you can handle quite a bit of data if you have anything more than trivial amounts of RAM. quantmod now has (devel) an attachSymbols function that makes lazy-loading data very easy, so all your data can be stored as xts objects and read in on-demand. xts is also getting the ability to query subsets of data on disk, by time. ?This will have no practical limit. For current data solutions xts, fts (C++), data.table, and some other solutions should mitigate your problems, if not solve the 'data' side all together. HTH Jeff On Thu, May 21, 2009 at 11:13 AM, Hae Kyung Im <hakyim at gmail.com> wrote:
I think in general you would need some sort of pre-processing before using R. You can use periodic sampling of prices, but you may be throwing away a lot of information. This is a method that used to be recommended more than 5 years ago in order to mitigate the effect of market noise. At least in the context of volatility estimation. Here is my experience with tick data: I used FX data to calculate estimated daily volatility using TSRV (Zhang et al 2005 http://galton.uchicago.edu/~mykland/paperlinks/p1394.pdf). Using the time series of estimated daily volatilities, I forecasted volatilities for 1 day up to 1 year ahead. The tick data was in Quantitative Analytics database. I used their C++ API to query daily data, computed the TSRV estimator in C++ and saved the result in text file. Then I used R to read the estimated volatilities and used FARIMA to forecast volatility. An interesting thing about this type of series is that the fractional coefficient is approximately 0.4 in many instances. Bollerslev has a paper commenting on this fact. In another project, I had treasury futures market depth data. The data came in plain text format, with one file per day. Each day had more than 1 million entries. I don't think I could handle this with R. To get started I decided to use only actual trades. I used Python to filter out the trades. So this came down to ~60K entries per day. This I could handle with R. I used to.period from xts package to aggregate the data. In order to handle market depth data, we need some efficient way to access (query) this huge database. I looked a little bit into kdb but you have to pay ~25K to buy the software for one processor. I haven't been able to look more into this for now. Haky On Thu, May 21, 2009 at 10:15 AM, Jeff Ryan <jeff.a.ryan at gmail.com> wrote:
Not my domain, but you will more than likely have to aggregate to some sort of regular/homogenous type of series for most traditional tools to work. xts has to.period to aggregate up to a lower frequency from tick-level data. Coupled with something like na.locf you can make yourself some high frequency 'regular' data from 'irregular' Regular and irregular of course depend on what you are looking at (weekends missing in daily data can still be 'regular'). I'd be interested in hearing thoughts from those who actually tread in the high-freq domain... A wealth of information can be found here: ?http://www.olsen.ch/publications/working-papers/ Jeff On Thu, May 21, 2009 at 10:04 AM, Michael <comtech.usa at gmail.com> wrote:
Hi all, I am wondering if there are some special toolboxes to handle high frequency data in R? I have some high frequency data and was wondering what meaningful experiments can I run on these high frequency data. Not sure if normal (low frequency) financial time series textbook data analysis tools will work for high frequency data? Let's say I run a correlation between two stocks using the high frequency data, or run an ARMA model on one stock, will the results be meaningful? Could anybody point me some classroom types of treatment or lab tutorial type of document which show me what meaningful experiments/tests I can run on high frequency data? Thanks a lot!
_______________________________________________ R-SIG-Finance at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-finance -- Subscriber-posting only. -- If you want to post, subscribe first.
-- Jeffrey Ryan jeffrey.ryan at insightalgo.com ia: insight algorithmics www.insightalgo.com
_______________________________________________ R-SIG-Finance at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-finance -- Subscriber-posting only. -- If you want to post, subscribe first.
-- Jeffrey Ryan jeffrey.ryan at insightalgo.com ia: insight algorithmics www.insightalgo.com
Jeffrey Ryan jeffrey.ryan at insightalgo.com ia: insight algorithmics www.insightalgo.com
On 21 May 2009 at 11:13, Hae Kyung Im wrote:
| access (query) this huge database. I looked a little bit into kdb but | you have to pay ~25K to buy the software for one processor. I haven't True, but you can have "free" (as in beer) 32bit version that times out after two hours. That's not a bad compromise. I looked at it for a bit, and I has an R interface. (My blog has a patch to fix their then-broken interface to R's Datetime; I think they may have integrated that by now). Then again you can also pre-process into RData files, or use hdf5, or use a gazillion other methods. But the free trial version may just help for the odd research project like the one Haky described. Dirk
Three out of two people have difficulties with fractions.
High-frequency is not my specialty either, but a quote caught my attention:
On Thu, May 21, 2009 at 11:38 AM, Michael <comtech.usa at gmail.com> wrote:
My data are price change arrivals, irregularly spaced. But when there is no price change, the price stays constant. Therefore, in fact, at any time instant, you give me a time, I can give you the price at that very instant of time. So irregularly spaced data can be easily sampled to be regularly spaced data.
From a trader's perspective, you do not have "the price" at any time
outside of the instant a trade took place - you have NBBO (and market depth). Last trade's price may or may not be transactable again on either long or short side. You can alternatively say that you have an instanteneous "mid-market price" and a bid/ask spread to work with. Correct me if I'm wrong - I'd like to know how people in HF really look at their data. -- ET.
Is there any literature on the relative performance gain of preprocessing data into RData and then reading into R? Does it breakdown anywhere? I have 4 GB of data that I'm reading in and I/O is a large bottleneck. Brian -----Original Message----- From: r-sig-finance-bounces at stat.math.ethz.ch [mailto:r-sig-finance-bounces at stat.math.ethz.ch] On Behalf Of Dirk Eddelbuettel Sent: Thursday, May 21, 2009 1:42 PM To: Hae Kyung Im Cc: r-sig-finance at stat.math.ethz.ch Subject: [R-SIG-Finance] Kdb (Was: high frequency data analysis in R)
On 21 May 2009 at 11:13, Hae Kyung Im wrote:
| access (query) this huge database. I looked a little bit into kdb but | you have to pay ~25K to buy the software for one processor. I haven't True, but you can have "free" (as in beer) 32bit version that times out after two hours. That's not a bad compromise. I looked at it for a bit, and I has an R interface. (My blog has a patch to fix their then-broken interface to R's Datetime; I think they may have integrated that by now). Then again you can also pre-process into RData files, or use hdf5, or use a gazillion other methods. But the free trial version may just help for the odd research project like the one Haky described. Dirk
Three out of two people have difficulties with fractions. _______________________________________________ R-SIG-Finance at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-finance -- Subscriber-posting only. -- If you want to post, subscribe first. -------------------------------------------------------------------------- This message w/attachments (message) may be privileged, confidential or proprietary, and if you are not an intended recipient, please notify the sender, do not use or share it and delete it. Unless specifically indicated, this message is not an offer to sell or a solicitation of any investment products or other financial product or service, an official confirmation of any transaction, or an official statement of Merrill Lynch. Subject to applicable law, Merrill Lynch may monitor, review and retain e-communications (EC) traveling through its networks/systems. The laws of the country of each sender/recipient may impact the handling of EC, and EC may be archived, supervised and produced in countries other than the country in which you are located. This message cannot be guaranteed to be secure or error-free. References to "Merrill Lynch" are references to any company in the Merrill Lynch & Co., Inc. group of companies, which are wholly-owned by Bank of America Corporation. Securities and Insurance Products: * Are Not FDIC Insured * Are Not Bank Guaranteed * May Lose Value * Are Not a Bank Deposit * Are Not a Condition to Any Banking Service or Activity * Are Not Insured by Any Federal Government Agency. Attachments that are part of this E-communication may have additional important disclosures and disclaimers, which you should read. This message is subject to terms available at the following link: http://www.ml.com/e-communications_terms/. By messaging with Merrill Lynch you consent to the foregoing. --------------------------------------------------------------------------
I feel like I should change the title again... :) The RData files are compressed first off. If you don't want the gzip overhead, get rid of it. The xts format 'on-disk' is nothing more that the structure from memory written to disk. This manages to be both faster and takes up less space. It isn't a huge gain, but it allows for binary searching of the index to get to the data you want. I will put together a performance comparison at some point, and pass along. Jeff On Thu, May 21, 2009 at 12:52 PM, Rowe, Brian Lee Yung (Portfolio
Analytics) <B_Rowe at ml.com> wrote:
Is there any literature on the relative performance gain of preprocessing data into RData and then reading into R? Does it breakdown anywhere? I have 4 GB of data that I'm reading in and I/O is a large bottleneck. Brian -----Original Message----- From: r-sig-finance-bounces at stat.math.ethz.ch [mailto:r-sig-finance-bounces at stat.math.ethz.ch] On Behalf Of Dirk Eddelbuettel Sent: Thursday, May 21, 2009 1:42 PM To: Hae Kyung Im Cc: r-sig-finance at stat.math.ethz.ch Subject: [R-SIG-Finance] Kdb (Was: high frequency data analysis in R) On 21 May 2009 at 11:13, Hae Kyung Im wrote: | access (query) this huge database. I looked a little bit into kdb but | you have to pay ~25K to buy the software for one processor. I haven't True, but you can have "free" (as in beer) 32bit version that times out after two hours. That's not a bad compromise. I looked at it for a bit, and I has an R interface. (My blog has a patch to fix their then-broken interface to R's Datetime; I think they may have integrated that by now). ?Then again you can also pre-process into RData files, or use hdf5, or use a gazillion other methods. ? But the free trial version may just help for the odd research project like the one Haky described. Dirk -- Three out of two people have difficulties with fractions.
_______________________________________________ R-SIG-Finance at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-finance -- Subscriber-posting only. -- If you want to post, subscribe first. -------------------------------------------------------------------------- This message w/attachments (message) may be privileged, confidential or proprietary, and if you are not an intended recipient, please notify the sender, do not use or share it and delete it. Unless specifically indicated, this message is not an offer to sell or a solicitation of any investment products or other financial product or service, an official confirmation of any transaction, or an official statement of Merrill Lynch. Subject to applicable law, Merrill Lynch may monitor, review and retain e-communications (EC) traveling through its networks/systems. The laws of the country of each sender/recipient may impact the handling of EC, and EC may be archived, supervised and produced in countries other than the country in which you are located. This message cannot be guaranteed to be secure or error-free. References to "Merrill Lynch" are references to any company in the Merrill Lynch & Co., Inc. group of companies, which are wholly-owned by Bank of America Corporation. Secu! ?rities and Insurance Products: * Are Not FDIC Insured * Are Not Bank Guaranteed * May Lose Value * Are Not a Bank Deposit * Are Not a Condition to Any Banking Service or Activity * Are Not Insured by Any Federal Government Agency. Attachments that are part of this E-communication may have additional important disclosures and disclaimers, which you should read. This message is subject to terms available at the following link: http://www.ml.com/e-communications_terms/. By messaging with Merrill Lynch you consent to the foregoing. -------------------------------------------------------------------------- _______________________________________________ R-SIG-Finance at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-finance -- Subscriber-posting only. -- If you want to post, subscribe first.
Jeffrey Ryan jeffrey.ryan at insightalgo.com ia: insight algorithmics www.insightalgo.com
... and possibly even a list change. Do you plan on making this compatible with ff or bigmemory? Seems like this theme is making its rounds. Brian -----Original Message----- From: Jeff Ryan [mailto:jeff.a.ryan at gmail.com] Sent: Thursday, May 21, 2009 1:58 PM To: Rowe, Brian Lee Yung (Portfolio Analytics) Cc: Dirk Eddelbuettel; Hae Kyung Im; r-sig-finance at stat.math.ethz.ch Subject: Re: [R-SIG-Finance] Preprocessing RData file (Was: Kdb (Was: high frequency data analysis in R)) I feel like I should change the title again... :) The RData files are compressed first off. If you don't want the gzip overhead, get rid of it. The xts format 'on-disk' is nothing more that the structure from memory written to disk. This manages to be both faster and takes up less space. It isn't a huge gain, but it allows for binary searching of the index to get to the data you want. I will put together a performance comparison at some point, and pass along. Jeff On Thu, May 21, 2009 at 12:52 PM, Rowe, Brian Lee Yung (Portfolio
Analytics) <B_Rowe at ml.com> wrote:
Is there any literature on the relative performance gain of preprocessing data into RData and then reading into R? Does it breakdown anywhere? I have 4 GB of data that I'm reading in and I/O is a large bottleneck. Brian -----Original Message----- From: r-sig-finance-bounces at stat.math.ethz.ch [mailto:r-sig-finance-bounces at stat.math.ethz.ch] On Behalf Of Dirk Eddelbuettel Sent: Thursday, May 21, 2009 1:42 PM To: Hae Kyung Im Cc: r-sig-finance at stat.math.ethz.ch Subject: [R-SIG-Finance] Kdb (Was: high frequency data analysis in R) On 21 May 2009 at 11:13, Hae Kyung Im wrote: | access (query) this huge database. I looked a little bit into kdb but | you have to pay ~25K to buy the software for one processor. I haven't True, but you can have "free" (as in beer) 32bit version that times out after two hours. That's not a bad compromise. I looked at it for a bit, and I has an R interface. (My blog has a patch to fix their then-broken interface to R's Datetime; I think they may have integrated that by now). ?Then again you can also pre-process into RData files, or use hdf5, or use a gazillion other methods. ? But the free trial version may just help for the odd research project like the one Haky described. Dirk -- Three out of two people have difficulties with fractions.
_______________________________________________ R-SIG-Finance at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-finance -- Subscriber-posting only. -- If you want to post, subscribe first. -------------------------------------------------------------------------- This message w/attachments (message) may be privileged, confidential or proprietary, and if you are not an intended recipient, please notify the sender, do not use or share it and delete it. Unless specifically indicated, this message is not an offer to sell or a solicitation of any investment products or other financial product or service, an official confirmation of any transaction, or an official statement of Merrill Lynch. Subject to applicable law, Merrill Lynch may monitor, review and retain e-communications (EC) traveling through its networks/systems. The laws of the country of each sender/recipient may impact the handling of EC, and EC may be archived, supervised and produced in countries other than the country in which you are located. This message cannot be guaranteed to be secure or error-free. References to "Merrill Lynch" are references to any company in the Merrill Lynch & Co., Inc. group of companies, which are wholly-owned by Bank of America Corporation. Secu! ?rities and Insurance Products: * Are Not FDIC Insured * Are Not Bank Guaranteed * May Lose Value * Are Not a Bank Deposit * Are Not a Condition to Any Banking Service or Activity * Are Not Insured by Any Federal Government Agency. Attachments that are part of this E-communication may have additional important disclosures and disclaimers, which you should read. This message is subject to terms available at the following link: http://www.ml.com/e-communications_terms/. By messaging with Merrill Lynch you consent to the foregoing. -------------------------------------------------------------------------- _______________________________________________ R-SIG-Finance at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-finance -- Subscriber-posting only. -- If you want to post, subscribe first.
Jeffrey Ryan jeffrey.ryan at insightalgo.com ia: insight algorithmics www.insightalgo.com
Some resources: If you want to deal with irregular data, Eric Zivot's book on financial time series mentions some operators that are available in S-Plus based on this Zumback/Muller paper: http://papers.ssrn.com/sol3/papers.cfm?abstract_id=208278. These could be easily adapted to R. Ruey Tsay's book also has a chapter that touches on it. In general terms, Dacorogna et. al. is a good overview. And as already mentioned, definitely look at the realized package: http://students.washington.edu/spayseur/realized/.
On Thu, May 21, 2009 at 12:51 PM, Hae Kyung Im <hakyim at gmail.com> wrote:
Relating the approach that turns irregular data into regular one, I guess it's a complex question and how you approach it will depend on the specific problem. With your method, you would assume that the price is equal to the last traded price or something like that. If there is no trade for some time, would it make sense to say that the price is the last traded price? If you wanted to actually buy/sell at that price, it's not obvious that you'll be able to do so. Also, if you only look at the time series of instantaneous prices, you would be losing a lot of information about what happened in between the time points. It makes more sense to aggregate and keep, for example, open, high, low and close. Or some statistics on the distribution of the prices between the endpoints. If what you need to calculate is correlations, then I would look at the papers that Liviu suggested. It seems that synchronicity is critical. I heard there is an extension of TSRV to correlations. If you only need to look at univariate time series, you may be able to get away with your method more easily. It may not be statistically efficient but may give you a good enough answer in some cases. HTH Haky On Thu, May 21, 2009 at 10:38 AM, Michael <comtech.usa at gmail.com> wrote:
My data are price change arrivals, irregularly spaced. But when there is no price change, the price stays constant. Therefore, in fact, at any time instant, you give me a time, I can give you the price at that very instant of time. So irregularly spaced data can be easily sampled to be regularly spaced data. What do you think of this approach? On Thu, May 21, 2009 at 8:21 AM, Michael <comtech.usa at gmail.com> wrote:
Thanks Jeff. By high frequency I mean really the tick data. For example, during peak time, the arrival of price events could be at about hundreds to thousands within one second, irregularly spaced. I've heard that forcing irregularly spaced data into regularly spaced data(e.g. through interpolation) will lose information. How's that so? Thanks! On Thu, May 21, 2009 at 8:15 AM, Jeff Ryan <jeff.a.ryan at gmail.com> wrote:
Not my domain, but you will more than likely have to aggregate to some sort of regular/homogenous type of series for most traditional tools to work. xts has to.period to aggregate up to a lower frequency from tick-level data. Coupled with something like na.locf you can make yourself some high frequency 'regular' data from 'irregular' Regular and irregular of course depend on what you are looking at (weekends missing in daily data can still be 'regular'). I'd be interested in hearing thoughts from those who actually tread in the high-freq domain... A wealth of information can be found here: ?http://www.olsen.ch/publications/working-papers/ Jeff On Thu, May 21, 2009 at 10:04 AM, Michael <comtech.usa at gmail.com> wrote:
Hi all, I am wondering if there are some special toolboxes to handle high frequency data in R? I have some high frequency data and was wondering what meaningful experiments can I run on these high frequency data. Not sure if normal (low frequency) financial time series textbook data analysis tools will work for high frequency data? Let's say I run a correlation between two stocks using the high frequency data, or run an ARMA model on one stock, will the results be meaningful? Could anybody point me some classroom types of treatment or lab tutorial type of document which show me what meaningful experiments/tests I can run on high frequency data? Thanks a lot!
_______________________________________________ R-SIG-Finance at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-finance -- Subscriber-posting only. -- If you want to post, subscribe first.
-- Jeffrey Ryan jeffrey.ryan at insightalgo.com ia: insight algorithmics www.insightalgo.com
_______________________________________________ R-SIG-Finance at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-finance -- Subscriber-posting only. -- If you want to post, subscribe first.
_______________________________________________ R-SIG-Finance at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-finance -- Subscriber-posting only. -- If you want to post, subscribe first.
I have given some thought to both ff and bigmemory. I am not a huge fan of the "ff" license. http://cran.r-project.org/web/packages/ff/LICENSE bigmemory is interesting in that you can bypass the R memory issues on Windows, but I haven't had incredible luck with it. Me and C++ don't like each other, so maybe it is something related to that :). I can get around the Windows issues by using something non-windows... Supposedly the changes to the most recent bigmemory are quite good, but trying the shared memory route (one of the biggest reasons I would like to use) has caused me catastrophic failure. At the end of the day there is no good way to make xts rely on bigmemory. As so much code in in C for xts, you can't readily operate on the external pointers from there. You need to read in the data (via `[` ) and at that point it is resident to the R process, so you are only getting the penalty of the memory allocation, and none of the gain. Of course this is my 2c. Maybe we need another time-series library :) Jeff On Thu, May 21, 2009 at 1:10 PM, Rowe, Brian Lee Yung (Portfolio
Analytics) <B_Rowe at ml.com> wrote:
... and possibly even a list change. Do you plan on making this compatible with ff or bigmemory? Seems like this theme is making its rounds. Brian -----Original Message----- From: Jeff Ryan [mailto:jeff.a.ryan at gmail.com] Sent: Thursday, May 21, 2009 1:58 PM To: Rowe, Brian Lee Yung (Portfolio Analytics) Cc: Dirk Eddelbuettel; Hae Kyung Im; r-sig-finance at stat.math.ethz.ch Subject: Re: [R-SIG-Finance] Preprocessing RData file (Was: Kdb (Was: high frequency data analysis in R)) I feel like I should change the title again... :) The RData files are compressed first off. If you don't want the gzip overhead, get rid of it. The xts format 'on-disk' is nothing more that the structure from memory written to disk. ?This manages to be both faster and takes up less space. ?It isn't a huge gain, but it allows for binary searching of the index to get to the data you want. I will put together a performance comparison at some point, and pass along. Jeff On Thu, May 21, 2009 at 12:52 PM, Rowe, Brian Lee Yung (Portfolio Analytics) <B_Rowe at ml.com> wrote:
Is there any literature on the relative performance gain of preprocessing data into RData and then reading into R? Does it breakdown anywhere? I have 4 GB of data that I'm reading in and I/O is a large bottleneck. Brian -----Original Message----- From: r-sig-finance-bounces at stat.math.ethz.ch [mailto:r-sig-finance-bounces at stat.math.ethz.ch] On Behalf Of Dirk Eddelbuettel Sent: Thursday, May 21, 2009 1:42 PM To: Hae Kyung Im Cc: r-sig-finance at stat.math.ethz.ch Subject: [R-SIG-Finance] Kdb (Was: high frequency data analysis in R) On 21 May 2009 at 11:13, Hae Kyung Im wrote: | access (query) this huge database. I looked a little bit into kdb but | you have to pay ~25K to buy the software for one processor. I haven't True, but you can have "free" (as in beer) 32bit version that times out after two hours. That's not a bad compromise. I looked at it for a bit, and I has an R interface. (My blog has a patch to fix their then-broken interface to R's Datetime; I think they may have integrated that by now). ?Then again you can also pre-process into RData files, or use hdf5, or use a gazillion other methods. ? But the free trial version may just help for the odd research project like the one Haky described. Dirk -- Three out of two people have difficulties with fractions.
_______________________________________________ R-SIG-Finance at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-finance -- Subscriber-posting only. -- If you want to post, subscribe first. -------------------------------------------------------------------------- This message w/attachments (message) may be privileged, confidential or proprietary, and if you are not an intended recipient, please notify the sender, do not use or share it and delete it. Unless specifically indicated, this message is not an offer to sell or a solicitation of any investment products or other financial product or service, an official confirmation of any transaction, or an official statement of Merrill Lynch. Subject to applicable law, Merrill Lynch may monitor, review and retain e-communications (EC) traveling through its networks/systems. The laws of the country of each sender/recipient may impact the handling of EC, and EC may be archived, supervised and produced in countries other than the country in which you are located. This message cannot be guaranteed to be secure or error-free. References to "Merrill Lynch" are references to any company in the Merrill Lynch & Co., Inc. group of companies, which are wholly-owned by Bank of America Corporation. Secu! ?rities and Insurance Products: * Are Not FDIC Insured * Are Not Bank Guaranteed * May Lose Value * Are Not a Bank Deposit * Are Not a Condition to Any Banking Service or Activity * Are Not Insured by Any Federal Government Agency. Attachments that are part of this E-communication may have additional important disclosures and disclaimers, which you should read. This message is subject to terms available at the following link: http://www.ml.com/e-communications_terms/. By messaging with Merrill Lynch you consent to the foregoing. -------------------------------------------------------------------------- _______________________________________________ R-SIG-Finance at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-finance -- Subscriber-posting only. -- If you want to post, subscribe first.
-- Jeffrey Ryan jeffrey.ryan at insightalgo.com ia: insight algorithmics www.insightalgo.com
Jeffrey Ryan jeffrey.ryan at insightalgo.com ia: insight algorithmics www.insightalgo.com
I failed to point out that data.table can make use of both (?) those packages. It isn't a time-series library per se, but it make one very cool in-memory database. Similar in spirit to some of the not-so-free ones out there... Jeff
On Thu, May 21, 2009 at 1:22 PM, Jeff Ryan <jeff.a.ryan at gmail.com> wrote:
I have given some thought to both ff and bigmemory. ?I am not a huge fan of the "ff" license. http://cran.r-project.org/web/packages/ff/LICENSE bigmemory is interesting in that you can bypass the R memory issues on Windows, but I haven't had incredible luck with it. ?Me and C++ don't like each other, so maybe it is something related to that :). ?I can get around the Windows issues by using something non-windows... Supposedly the changes to the most recent bigmemory are quite good, but trying the shared memory route (one of the biggest reasons I would like to use) has caused me catastrophic failure. At the end of the day there is no good way to make xts rely on bigmemory. As so much code in in C for xts, you can't readily operate on the external pointers from there. ?You need to read in the data (via `[` ) and at that point it is resident to the R process, so you are only getting the penalty of the memory allocation, and none of the gain. Of course this is my 2c. ?Maybe we need another time-series library :) Jeff On Thu, May 21, 2009 at 1:10 PM, Rowe, Brian Lee Yung (Portfolio Analytics) <B_Rowe at ml.com> wrote:
... and possibly even a list change. Do you plan on making this compatible with ff or bigmemory? Seems like this theme is making its rounds. Brian -----Original Message----- From: Jeff Ryan [mailto:jeff.a.ryan at gmail.com] Sent: Thursday, May 21, 2009 1:58 PM To: Rowe, Brian Lee Yung (Portfolio Analytics) Cc: Dirk Eddelbuettel; Hae Kyung Im; r-sig-finance at stat.math.ethz.ch Subject: Re: [R-SIG-Finance] Preprocessing RData file (Was: Kdb (Was: high frequency data analysis in R)) I feel like I should change the title again... :) The RData files are compressed first off. If you don't want the gzip overhead, get rid of it. The xts format 'on-disk' is nothing more that the structure from memory written to disk. ?This manages to be both faster and takes up less space. ?It isn't a huge gain, but it allows for binary searching of the index to get to the data you want. I will put together a performance comparison at some point, and pass along. Jeff On Thu, May 21, 2009 at 12:52 PM, Rowe, Brian Lee Yung (Portfolio Analytics) <B_Rowe at ml.com> wrote:
Is there any literature on the relative performance gain of preprocessing data into RData and then reading into R? Does it breakdown anywhere? I have 4 GB of data that I'm reading in and I/O is a large bottleneck. Brian -----Original Message----- From: r-sig-finance-bounces at stat.math.ethz.ch [mailto:r-sig-finance-bounces at stat.math.ethz.ch] On Behalf Of Dirk Eddelbuettel Sent: Thursday, May 21, 2009 1:42 PM To: Hae Kyung Im Cc: r-sig-finance at stat.math.ethz.ch Subject: [R-SIG-Finance] Kdb (Was: high frequency data analysis in R) On 21 May 2009 at 11:13, Hae Kyung Im wrote: | access (query) this huge database. I looked a little bit into kdb but | you have to pay ~25K to buy the software for one processor. I haven't True, but you can have "free" (as in beer) 32bit version that times out after two hours. That's not a bad compromise. I looked at it for a bit, and I has an R interface. (My blog has a patch to fix their then-broken interface to R's Datetime; I think they may have integrated that by now). ?Then again you can also pre-process into RData files, or use hdf5, or use a gazillion other methods. ? But the free trial version may just help for the odd research project like the one Haky described. Dirk -- Three out of two people have difficulties with fractions.
_______________________________________________ R-SIG-Finance at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-finance -- Subscriber-posting only. -- If you want to post, subscribe first. -------------------------------------------------------------------------- This message w/attachments (message) may be privileged, confidential or proprietary, and if you are not an intended recipient, please notify the sender, do not use or share it and delete it. Unless specifically indicated, this message is not an offer to sell or a solicitation of any investment products or other financial product or service, an official confirmation of any transaction, or an official statement of Merrill Lynch. Subject to applicable law, Merrill Lynch may monitor, review and retain e-communications (EC) traveling through its networks/systems. The laws of the country of each sender/recipient may impact the handling of EC, and EC may be archived, supervised and produced in countries other than the country in which you are located. This message cannot be guaranteed to be secure or error-free. References to "Merrill Lynch" are references to any company in the Merrill Lynch & Co., Inc. group of companies, which are wholly-owned by Bank of America Corporation. Secu! ?rities and Insurance Products: * Are Not FDIC Insured * Are Not Bank Guaranteed * May Lose Value * Are Not a Bank Deposit * Are Not a Condition to Any Banking Service or Activity * Are Not Insured by Any Federal Government Agency. Attachments that are part of this E-communication may have additional important disclosures and disclaimers, which you should read. This message is subject to terms available at the following link: http://www.ml.com/e-communications_terms/. By messaging with Merrill Lynch you consent to the foregoing. -------------------------------------------------------------------------- _______________________________________________ R-SIG-Finance at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-finance -- Subscriber-posting only. -- If you want to post, subscribe first.
-- Jeffrey Ryan jeffrey.ryan at insightalgo.com ia: insight algorithmics www.insightalgo.com
-- Jeffrey Ryan jeffrey.ryan at insightalgo.com ia: insight algorithmics www.insightalgo.com
Jeffrey Ryan jeffrey.ryan at insightalgo.com ia: insight algorithmics www.insightalgo.com
In fact, I have the whole jump processes of best bid, and best ask, at a continuous level (in the sense of time-stamped arrival data), and also the jump process of the last trade price, at a continuous level (in the sense of time-stamped arrival data). Any more thoughts?
On Thu, May 21, 2009 at 9:51 AM, Hae Kyung Im <hakyim at gmail.com> wrote:
Relating the approach that turns irregular data into regular one, I guess it's a complex question and how you approach it will depend on the specific problem. With your method, you would assume that the price is equal to the last traded price or something like that. If there is no trade for some time, would it make sense to say that the price is the last traded price? If you wanted to actually buy/sell at that price, it's not obvious that you'll be able to do so. Also, if you only look at the time series of instantaneous prices, you would be losing a lot of information about what happened in between the time points. It makes more sense to aggregate and keep, for example, open, high, low and close. Or some statistics on the distribution of the prices between the endpoints. If what you need to calculate is correlations, then I would look at the papers that Liviu suggested. It seems that synchronicity is critical. I heard there is an extension of TSRV to correlations. If you only need to look at univariate time series, you may be able to get away with your method more easily. It may not be statistically efficient but may give you a good enough answer in some cases. HTH Haky On Thu, May 21, 2009 at 10:38 AM, Michael <comtech.usa at gmail.com> wrote:
My data are price change arrivals, irregularly spaced. But when there is no price change, the price stays constant. Therefore, in fact, at any time instant, you give me a time, I can give you the price at that very instant of time. So irregularly spaced data can be easily sampled to be regularly spaced data. What do you think of this approach? On Thu, May 21, 2009 at 8:21 AM, Michael <comtech.usa at gmail.com> wrote:
Thanks Jeff. By high frequency I mean really the tick data. For example, during peak time, the arrival of price events could be at about hundreds to thousands within one second, irregularly spaced. I've heard that forcing irregularly spaced data into regularly spaced data(e.g. through interpolation) will lose information. How's that so? Thanks! On Thu, May 21, 2009 at 8:15 AM, Jeff Ryan <jeff.a.ryan at gmail.com> wrote:
Not my domain, but you will more than likely have to aggregate to some sort of regular/homogenous type of series for most traditional tools to work. xts has to.period to aggregate up to a lower frequency from tick-level data. Coupled with something like na.locf you can make yourself some high frequency 'regular' data from 'irregular' Regular and irregular of course depend on what you are looking at (weekends missing in daily data can still be 'regular'). I'd be interested in hearing thoughts from those who actually tread in the high-freq domain... A wealth of information can be found here: ?http://www.olsen.ch/publications/working-papers/ Jeff On Thu, May 21, 2009 at 10:04 AM, Michael <comtech.usa at gmail.com> wrote:
Hi all, I am wondering if there are some special toolboxes to handle high frequency data in R? I have some high frequency data and was wondering what meaningful experiments can I run on these high frequency data. Not sure if normal (low frequency) financial time series textbook data analysis tools will work for high frequency data? Let's say I run a correlation between two stocks using the high frequency data, or run an ARMA model on one stock, will the results be meaningful? Could anybody point me some classroom types of treatment or lab tutorial type of document which show me what meaningful experiments/tests I can run on high frequency data? Thanks a lot!
_______________________________________________ R-SIG-Finance at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-finance -- Subscriber-posting only. -- If you want to post, subscribe first.
-- Jeffrey Ryan jeffrey.ryan at insightalgo.com ia: insight algorithmics www.insightalgo.com
_______________________________________________ R-SIG-Finance at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-finance -- Subscriber-posting only. -- If you want to post, subscribe first.
fts underneath is a c++ policy template based class. The underlying storage can be pretty much anything as long as you implement the public api. here is the backend for R: http://github.com/armstrtw/r.tslib.backend/tree/master the backend could just as easily be a python object allocator, a dataframe, or a flat file. -Whit
On Thu, May 21, 2009 at 2:22 PM, Jeff Ryan <jeff.a.ryan at gmail.com> wrote:
I have given some thought to both ff and bigmemory. ?I am not a huge fan of the "ff" license. http://cran.r-project.org/web/packages/ff/LICENSE bigmemory is interesting in that you can bypass the R memory issues on Windows, but I haven't had incredible luck with it. ?Me and C++ don't like each other, so maybe it is something related to that :). ?I can get around the Windows issues by using something non-windows... Supposedly the changes to the most recent bigmemory are quite good, but trying the shared memory route (one of the biggest reasons I would like to use) has caused me catastrophic failure. At the end of the day there is no good way to make xts rely on bigmemory. As so much code in in C for xts, you can't readily operate on the external pointers from there. ?You need to read in the data (via `[` ) and at that point it is resident to the R process, so you are only getting the penalty of the memory allocation, and none of the gain. Of course this is my 2c. ?Maybe we need another time-series library :) Jeff On Thu, May 21, 2009 at 1:10 PM, Rowe, Brian Lee Yung (Portfolio Analytics) <B_Rowe at ml.com> wrote:
... and possibly even a list change. Do you plan on making this compatible with ff or bigmemory? Seems like this theme is making its rounds. Brian -----Original Message----- From: Jeff Ryan [mailto:jeff.a.ryan at gmail.com] Sent: Thursday, May 21, 2009 1:58 PM To: Rowe, Brian Lee Yung (Portfolio Analytics) Cc: Dirk Eddelbuettel; Hae Kyung Im; r-sig-finance at stat.math.ethz.ch Subject: Re: [R-SIG-Finance] Preprocessing RData file (Was: Kdb (Was: high frequency data analysis in R)) I feel like I should change the title again... :) The RData files are compressed first off. If you don't want the gzip overhead, get rid of it. The xts format 'on-disk' is nothing more that the structure from memory written to disk. ?This manages to be both faster and takes up less space. ?It isn't a huge gain, but it allows for binary searching of the index to get to the data you want. I will put together a performance comparison at some point, and pass along. Jeff On Thu, May 21, 2009 at 12:52 PM, Rowe, Brian Lee Yung (Portfolio Analytics) <B_Rowe at ml.com> wrote:
Is there any literature on the relative performance gain of preprocessing data into RData and then reading into R? Does it breakdown anywhere? I have 4 GB of data that I'm reading in and I/O is a large bottleneck. Brian -----Original Message----- From: r-sig-finance-bounces at stat.math.ethz.ch [mailto:r-sig-finance-bounces at stat.math.ethz.ch] On Behalf Of Dirk Eddelbuettel Sent: Thursday, May 21, 2009 1:42 PM To: Hae Kyung Im Cc: r-sig-finance at stat.math.ethz.ch Subject: [R-SIG-Finance] Kdb (Was: high frequency data analysis in R) On 21 May 2009 at 11:13, Hae Kyung Im wrote: | access (query) this huge database. I looked a little bit into kdb but | you have to pay ~25K to buy the software for one processor. I haven't True, but you can have "free" (as in beer) 32bit version that times out after two hours. That's not a bad compromise. I looked at it for a bit, and I has an R interface. (My blog has a patch to fix their then-broken interface to R's Datetime; I think they may have integrated that by now). ?Then again you can also pre-process into RData files, or use hdf5, or use a gazillion other methods. ? But the free trial version may just help for the odd research project like the one Haky described. Dirk -- Three out of two people have difficulties with fractions.
_______________________________________________ R-SIG-Finance at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-finance -- Subscriber-posting only. -- If you want to post, subscribe first. -------------------------------------------------------------------------- This message w/attachments (message) may be privileged, confidential or proprietary, and if you are not an intended recipient, please notify the sender, do not use or share it and delete it. Unless specifically indicated, this message is not an offer to sell or a solicitation of any investment products or other financial product or service, an official confirmation of any transaction, or an official statement of Merrill Lynch. Subject to applicable law, Merrill Lynch may monitor, review and retain e-communications (EC) traveling through its networks/systems. The laws of the country of each sender/recipient may impact the handling of EC, and EC may be archived, supervised and produced in countries other than the country in which you are located. This message cannot be guaranteed to be secure or error-free. References to "Merrill Lynch" are references to any company in the Merrill Lynch & Co., Inc. group of companies, which are wholly-owned by Bank of America Corporation. Secu! ?rities and Insurance Products: * Are Not FDIC Insured * Are Not Bank Guaranteed * May Lose Value * Are Not a Bank Deposit * Are Not a Condition to Any Banking Service or Activity * Are Not Insured by Any Federal Government Agency. Attachments that are part of this E-communication may have additional important disclosures and disclaimers, which you should read. This message is subject to terms available at the following link: http://www.ml.com/e-communications_terms/. By messaging with Merrill Lynch you consent to the foregoing. -------------------------------------------------------------------------- _______________________________________________ R-SIG-Finance at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-finance -- Subscriber-posting only. -- If you want to post, subscribe first.
-- Jeffrey Ryan jeffrey.ryan at insightalgo.com ia: insight algorithmics www.insightalgo.com
-- Jeffrey Ryan jeffrey.ryan at insightalgo.com ia: insight algorithmics www.insightalgo.com
_______________________________________________ R-SIG-Finance at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-finance -- Subscriber-posting only. -- If you want to post, subscribe first.
I'm new to R and interested in working with large amounts of data (timeseries, but regularly spaced.) Can you point me to a good reference for using data.table with bigmemory or ff? (I'm a bit puzzled about what exactly these packages provide. As I understand it, on 32-bit platforms files are subject to the same 2GB limit as in-process memory, so I assume that dealing with a larger dataset still requires breaking it up into multiple files...) Thanks for your help. I failed to point out that data.table can make use of both (?) those packages. [ff, bigmemory] It isn't a time-series library per se, but it make one very cool in-memory database. Similar in spirit to some of the not-so-free ones out there... Jeff
View this message in context: http://www.nabble.com/high-frequency-data-analysis-in-R-tp23654793p23671313.html Sent from the Rmetrics mailing list archive at Nabble.com.
From a recent post by the author:
http://finzi.psych.upenn.edu/R/Rhelp08/2009-March/193490.html Further information on 'ff' and 'bigmemory' is covered in those respective packages. As far as combining the two/three, I would wait to hear back from Matt on exactly how to do that. I thought there was an example somewhere if I recall... The main advantage to using large datasets in RAM is simply efficiency. 'ff' makes that process manageable without a lot of RAM, bigmemory can bypass single process limits of R (and do some cool memory sharing). The advantage to both is really confined to 32bit processing, if I am thinking straight. This is probably more of a question for R-help at this point, or even R-Sig-db though, as the 'finance' part is only tangential. If you can break the data up with something like a db scheme, then xts will be faster than all (?) the other solutions for in-memory manipulation -- as it is time-series oriented. And if you've got 64bits and lots of RAM it should do most of what you need. HTH Jeff
On Fri, May 22, 2009 at 9:18 AM, Steve Jaffe <sjaffe at riskspan.com> wrote:
I'm new to R and interested in working with large amounts of data (timeseries, but regularly spaced.) Can you point me to a good reference for using data.table with bigmemory or ff? (I'm a bit puzzled about what exactly these packages provide. As I understand it, on 32-bit platforms files are subject to the same 2GB limit as in-process memory, so I assume that dealing with a larger dataset still requires breaking it up into multiple files...) Thanks for your help. I failed to point out that data.table can make use of both (?) those packages. [ff, bigmemory] It isn't a time-series library per se, but it make one very cool in-memory database. ?Similar in spirit to some of the not-so-free ones out there... Jeff -- View this message in context: http://www.nabble.com/high-frequency-data-analysis-in-R-tp23654793p23671313.html Sent from the Rmetrics mailing list archive at Nabble.com.
_______________________________________________ R-SIG-Finance at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-finance -- Subscriber-posting only. -- If you want to post, subscribe first.
Jeffrey Ryan jeffrey.ryan at insightalgo.com ia: insight algorithmics www.insightalgo.com