Skip to content
Back to formatted view

Raw Message

Message-ID: <9q0ipdt289h.fsf@gmx.us>
Date: 2012-07-11T23:22:18Z
From: Mikhail Titov
Subject: do I need plyr, apply or something else?
In-Reply-To: <CAAmySGMR93t6W7FpXjC+Fy9pgx2OrReguBDWGezqsENwy9zLxg@mail.gmail.com> (R. Michael Weylandt's message of "Wed, 11 Jul 2012 17:02:10 -0500")

"R. Michael Weylandt" <michael.weylandt at gmail.com> writes:

> On Wed, Jul 11, 2012 at 10:05 AM, Russell Bowdrey
> <Russell.Bowdrey at justretirement.com> wrote:
>>
>> Dear all,
>>
>> This is what I'd like to do (I have an implementation using for
>> loops, which I designed before I realised just how slow R is at
>> executing them - this process currently takes days to run).
>>
>> I have a large dataframe containing corporate bond data, columns are:
>> BondID
>> Date (goes back 5years)
>> Var1
>> Var2
>> Term2Maturity
>>
>> What I want to do is this:
>>
>> 1)      For each bond, at each given date, look back over 1 year and append some statistics to each row ( sd(Var1), cor(Var1,Var2) over that year etc)
>>
>
> Look at the TTR package and the various run** functions. Much faster.
>
>> a.  It seems I might be able to use ddply for this, but I can't work
>> out how to code the stats function to only look back over one year,
>> rather than the full data range
>>
>> b.      For example: dfBondsWithCorr<-ddply(dfBonds, .(BondID), transform,corr=cor(Var1,Var2),.progress="text")
>> returns a dataframe where for each bond it has same corr for each date
>>
>> 2) On each date, subset dfBondsWithCorr by certain qualification
>> criteria, then to the qualifiers fit a regression through a Var1 and
>> Term2Maturity, output the regression as a df of curves (say for each
>> date, a curve represented by points every 0.5 years)
>>
>> a.  I can do this pretty efficiently for a single date (and I
>> suppose I could wrap that in a function) , but can't quite see how
>> to do the filtering and spitting out of curves over multiple dates
>> without using for loops
>>
>
> This ones harder. For simple linear regressions, you can solve the
> regression analytically (e.g., slope = runCov / runVar and mean
> similarly) but doing it for more complicated regressions will pretty
> much require a for loop of one sort or another. Can you say what sort
> of model you are looking to use?
>
>> Would appreciate any thoughts, many thanks in advance

I feel like PostgreSQL will do the work better. It has support for basic
statistics [1] and you can use window functions [2] to limit the scope
for last year only. Then you get your data with RODBC or something.

I suspect you have you data in some sort of DB in the first
place. Perhaps it has similar features.

[1] http://www.postgresql.org/docs/9.1/static/functions-aggregate.html#FUNCTIONS-AGGREGATE-STATISTICS-TABLE
[2] http://www.postgresql.org/docs/9.1/interactive/sql-expressions.html#SYNTAX-WINDOW-FUNCTIONS

-- 
Mikhail