do I need plyr, apply or something else?

Wed, Jul 11, 2012 4:22 PM

"R. Michael Weylandt" <michael.weylandt at gmail.com> writes:

On Wed, Jul 11, 2012 at 10:05 AM, Russell Bowdrey
<Russell.Bowdrey at justretirement.com> wrote:

Dear all,

This is what I'd like to do (I have an implementation using for
loops, which I designed before I realised just how slow R is at
executing them - this process currently takes days to run).

I have a large dataframe containing corporate bond data, columns are:
BondID
Date (goes back 5years)
Var1
Var2
Term2Maturity

What I want to do is this:

1)      For each bond, at each given date, look back over 1 year and append some statistics to each row ( sd(Var1), cor(Var1,Var2) over that year etc)

Look at the TTR package and the various run** functions. Much faster.

a.  It seems I might be able to use ddply for this, but I can't work
out how to code the stats function to only look back over one year,
rather than the full data range

b.      For example: dfBondsWithCorr<-ddply(dfBonds, .(BondID), transform,corr=cor(Var1,Var2),.progress="text")
returns a dataframe where for each bond it has same corr for each date

2) On each date, subset dfBondsWithCorr by certain qualification
criteria, then to the qualifiers fit a regression through a Var1 and
Term2Maturity, output the regression as a df of curves (say for each
date, a curve represented by points every 0.5 years)

a.  I can do this pretty efficiently for a single date (and I
suppose I could wrap that in a function) , but can't quite see how
to do the filtering and spitting out of curves over multiple dates
without using for loops

This ones harder. For simple linear regressions, you can solve the
regression analytically (e.g., slope = runCov / runVar and mean
similarly) but doing it for more complicated regressions will pretty
much require a for loop of one sort or another. Can you say what sort
of model you are looking to use?

Would appreciate any thoughts, many thanks in advance

I feel like PostgreSQL will do the work better. It has support for basic
statistics [1] and you can use window functions [2] to limit the scope
for last year only. Then you get your data with RODBC or something.

I suspect you have you data in some sort of DB in the first
place. Perhaps it has similar features.

[1] http://www.postgresql.org/docs/9.1/static/functions-aggregate.html#FUNCTIONS-AGGREGATE-STATISTICS-TABLE
[2] http://www.postgresql.org/docs/9.1/interactive/sql-expressions.html#SYNTAX-WINDOW-FUNCTIONS

Mikhail

do I need plyr, apply or something else?

Thread (4 messages)