Keeping persistent data collections

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-sig-finance/attachments/20111106/10e82a2e/attachment.pl>
Hello, I recently found this list and have been reading deeply the
archives. I am wondering how people here maintain their collections of data
for easy use in R. I am wondering a few things:

1) How do members of this list deal with keeping persistent data
collections with R? I was thinking of individual xts objects by asset and
frequency (such as AAPL daily, AAPL minute, AAPL 60m, etc). While I can
store and maintain these xts objects on disk and load them into R as
needed, I am wondering if there is a more better solution.
I store only tick data, as I can easily get to any other frequency from
tick.  I've considered also storing daily data, but in the end I decide
it is too much trouble to (additionally) manage, and just store tick.
2) Coming from that, I have been looking into the indexing package for my
needs. It seems very useful for managing a lot of large data sets in
memory, but I am not sure it is a good method for maintaining persistent
data, I have found trouble adding information to existing data that is
indexed on disk. Do poster here use indexing for this purpose? I did find
an old post or two touching on that with no specifics. I would like to be
able to combine the ability of indexing to have many large data sets
available in memory with persistent storage of data. Has anyone any
experience doing this?
You are correct that the 'indexing' package is very powerful.  It is
also not done yet.  

As I said, I store tick data.  The way I do this is with single files
per day of data per symbol, pre-parsed into xts objects and stored to
disk in one directory per symbol (using 'save').    

I then use FinancialInstrument to keep track of all the instrument
metadata, and getSymbols to load the data into R when I need it (and
over the time-frames that I require).  We currently download tick data
for about 2500 tradeable instruments per day, and maintain archives
going back several years.  We have the .instrument environment stored on
the same file server as the data, and every .Rprofile in the firm points
to this so that everyone has access to getInstrument and getSymbols

I know someone who works in the hedge fund industry, mostly with monthly
data, with some daily data sprinkled in.  He uses the same approach I
have outlined of storing the metadata in FinancialInstrument, and
getSymbols to access the data.  He typically stores one consolidated CSV
file per instrument, because CSV files are easy to add on to with a
batch process.  

For lower frequency data (let's say daily or lower) a database is
certainly an option, and there are getSymbols wrappers that could be
adapted to whatever schema you decided to use. Obviously, there are tick
data database providers such as OneTick and kdb, and if you have this
problem and the resources to need this type of solution, you probably
already know that you are in this camp, and know that these providers
have R interfaces of varying quality.

The FinancialInstrument package has a 'parsers' directory included in
the 'inst' directory of the package with many examples of download and
parse routines for regular loading of data from a variety of free or
subscription providers.  This should give you a lot of material to begin
working with your own data providers.
3) How do people keep track of all the data sets within R? Are there any
useful packages for keeping track of multiple sets of financial data and
the information about them?
We wrote and use FinancialInstrument for this purpose.

As I said earlier, I see no value in storing different periodicities,
and store only tick.

One of the reasons that I chose to write a getSymbols wrapper for
retrieving our tick data stores is that resources like this list have
extensive experience about using getSymbols, and it is therefore easy
for people at our firm to become familiar with using the data. 

Also, I am reasonably confident that as the indexing package matures,
there will be a getSymbols method for it as well, and if appropriate I
can easily convert all my data in one batch pass and it will be
transparent to my users.

I made what I now realize to have been a mistake at a previous firm in
writing a data retrieval function that was not compatible with
getSymbols which was more complex to teach people how to use it, and
less compatible with huge amounts of other publicly available code.

quantmod and FinancialInstrument contain examples of various getSymbols
methods that may meet your needs, or that could serve as templates for
your custom in-house data source.
4) Any other pointers? I know many here are well versed and manage large
data sets with R. Any tips you have or even simply showing me in a helpful
direction to useful packages you use is great. This list is a great help
for me and I am still browsing old threads!
Regards,

    - Brian
Brian G. Peterson
http://braverock.com/brian/
Ph: 773-459-4973
IM: bgpbraverock
I do what Brian described, and I use a couple functions from
FinancialInstrument to do it.

library(FinancialInstrument)
?saveSymbols.days
?saveSymbols.common
?getSymbols.FI

(I just noticed that those 2 saveSymbols.* functions do not allow for a
data extension other
than the old .rda.  I will probably update that today.)

I put together a little example, which I'll attach as well as paste below.

This is how I do it, but I certainly encourage suggestions for improvement.

HTH,
Garrett
library(FinancialInstrument)
# object with daily periodicity
data(sample_matrix)
DDD <- as.xts(sample_matrix)
#object with minute periodicity
AAA <- xts(rnorm(1:10000), Sys.time()-(60*1:10000))
AAA <- align.time(AAA)
colnames(AAA) <- "AAA"
# look at the objects we're going to store
head(AAA)
AAA
2011-10-31 09:04:00  0.05152989
2011-10-31 09:05:00  0.12797379
2011-10-31 09:06:00  0.96025183
2011-10-31 09:07:00 -0.23265907
2011-10-31 09:08:00  1.77706849
2011-10-31 09:09:00 -1.29139344
head(DDD)
Open     High      Low    Close
2007-01-02 50.03978 50.11778 49.95041 50.11778
2007-01-03 50.23050 50.42188 50.23050 50.39767
2007-01-04 50.42096 50.42096 50.26414 50.33236
2007-01-05 50.37347 50.37347 50.22103 50.33459
2007-01-06 50.24433 50.24433 50.11121 50.18112
2007-01-07 50.13211 50.21561 49.99185 49.99185
mydir <- getwd()
saveSymbols.days("AAA", base_dir=mydir)
saveSymbols.common("DDD", base_dir=mydir)
# now that they are on disk,
# remove them from workspace
rm("AAA", "DDD")
# get from disk
getSymbols("AAA", src='FI', dir=mydir, split_method='days',
from='2011-10-31')
[1] "AAA"
getSymbols("DDD", src='FI', dir=mydir, split_method='common')
[1] "DDD"
head(AAA)
AAA
2011-10-31 09:04:00  0.05152989
2011-10-31 09:05:00  0.12797379
2011-10-31 09:06:00  0.96025183
2011-10-31 09:07:00 -0.23265907
2011-10-31 09:08:00  1.77706849
2011-10-31 09:09:00 -1.29139344
head(DDD)
Open     High      Low    Close
2007-01-02 50.03978 50.11778 49.95041 50.11778
2007-01-03 50.23050 50.42188 50.23050 50.39767
2007-01-04 50.42096 50.42096 50.26414 50.33236
2007-01-05 50.37347 50.37347 50.22103 50.33459
2007-01-06 50.24433 50.24433 50.11121 50.18112
2007-01-07 50.13211 50.21561 49.99185 49.99185
#--------
# You can setSymbolLookup so that getSymbols will know where
# to look.  There are 2 ways to setSymbolLookup: explicitly,
# or by setting the "src" field of an instrument.

# explicitly
setSymbolLookup(DDD=list(src='FI', dir=mydir, split_method='common'))
getSymbols("DDD")
[1] "DDD"
# by using the "src" field of an instrument
stock("AAA", currency("USD"), src=list(src='FI', dir=mydir,
split_method='days'))
[1] "AAA"
getSymbols("AAA", from='2011-10-31')
[1] "AAA"
# cleanup
rm("AAA", "DDD")
unlink("AAA", recursive=TRUE)
unlink("DDD", recursive=TRUE)

On Sun, 2011-11-06 at 22:43 -0500, Dino Veritas wrote:
Hello, I recently found this list and have been reading deeply the
archives. I am wondering how people here maintain their collections of
data
for easy use in R. I am wondering a few things:

1) How do members of this list deal with keeping persistent data
collections with R? I was thinking of individual xts objects by asset and
frequency (such as AAPL daily, AAPL minute, AAPL 60m, etc). While I can
store and maintain these xts objects on disk and load them into R as
needed, I am wondering if there is a more better solution.
I store only tick data, as I can easily get to any other frequency from
tick.  I've considered also storing daily data, but in the end I decide
it is too much trouble to (additionally) manage, and just store tick.

2) Coming from that, I have been looking into the indexing package for my
needs. It seems very useful for managing a lot of large data sets in
memory, but I am not sure it is a good method for maintaining persistent
data, I have found trouble adding information to existing data that is
indexed on disk. Do poster here use indexing for this purpose? I did find
an old post or two touching on that with no specifics. I would like to be
able to combine the ability of indexing to have many large data sets
available in memory with persistent storage of data. Has anyone any
experience doing this?
You are correct that the 'indexing' package is very powerful.  It is
also not done yet.

As I said, I store tick data.  The way I do this is with single files
per day of data per symbol, pre-parsed into xts objects and stored to
disk in one directory per symbol (using 'save').

I then use FinancialInstrument to keep track of all the instrument
metadata, and getSymbols to load the data into R when I need it (and
over the time-frames that I require).  We currently download tick data
for about 2500 tradeable instruments per day, and maintain archives
going back several years.  We have the .instrument environment stored on
the same file server as the data, and every .Rprofile in the firm points
to this so that everyone has access to getInstrument and getSymbols

I know someone who works in the hedge fund industry, mostly with monthly
data, with some daily data sprinkled in.  He uses the same approach I
have outlined of storing the metadata in FinancialInstrument, and
getSymbols to access the data.  He typically stores one consolidated CSV
file per instrument, because CSV files are easy to add on to with a
batch process.

For lower frequency data (let's say daily or lower) a database is
certainly an option, and there are getSymbols wrappers that could be
adapted to whatever schema you decided to use. Obviously, there are tick
data database providers such as OneTick and kdb, and if you have this
problem and the resources to need this type of solution, you probably
already know that you are in this camp, and know that these providers
have R interfaces of varying quality.

The FinancialInstrument package has a 'parsers' directory included in
the 'inst' directory of the package with many examples of download and
parse routines for regular loading of data from a variety of free or
subscription providers.  This should give you a lot of material to begin
working with your own data providers.

3) How do people keep track of all the data sets within R? Are there any
useful packages for keeping track of multiple sets of financial data and
the information about them?
We wrote and use FinancialInstrument for this purpose.

As I said earlier, I see no value in storing different periodicities,
and store only tick.

One of the reasons that I chose to write a getSymbols wrapper for
retrieving our tick data stores is that resources like this list have
extensive experience about using getSymbols, and it is therefore easy
for people at our firm to become familiar with using the data.

Also, I am reasonably confident that as the indexing package matures,
there will be a getSymbols method for it as well, and if appropriate I
can easily convert all my data in one batch pass and it will be
transparent to my users.

I made what I now realize to have been a mistake at a previous firm in
writing a data retrieval function that was not compatible with
getSymbols which was more complex to teach people how to use it, and
less compatible with huge amounts of other publicly available code.

quantmod and FinancialInstrument contain examples of various getSymbols
methods that may meet your needs, or that could serve as templates for
your custom in-house data source.

4) Any other pointers? I know many here are well versed and manage large
data sets with R. Any tips you have or even simply showing me in a
helpful
direction to useful packages you use is great. This list is a great help
for me and I am still browsing old threads!
Regards,

   - Brian

--
Brian G. Peterson
http://braverock.com/brian/
Ph: 773-459-4973
IM: bgpbraverock

_______________________________________________
R-SIG-Finance at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-sig-finance
-- Subscriber-posting only. If you want to post, subscribe first.
-- Also note that this is not the r-help list where general R questions
should go.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://stat.ethz.ch/pipermail/r-sig-finance/attachments/20111107/dc8ee49e/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Rdatasaving.R
Type: text/x-r-source
Size: 1208 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/r-sig-finance/attachments/20111107/dc8ee49e/attachment.bin>