Thoughts for faster indexing

Steve Lianoglou · 2013-11-26T20:23:51Z

Hi, On Tue, Nov 26, 2013 at 11:41 AM, Noah Silverman wrote: > All interesting suggestions. > > I guess a better example of the code would have been a good idea. So, > I'll put a relevant snippet here. > > Rows are cases. There are multiple cases for each ID, marked with a > date. I'm trying to calculate a time recency weighted score for a > covariate, added as a new column in the data.frame. > > So, for each row, I need to see which ID it belongs to, then get all

Steve Lianoglou

Tue, Nov 26, 2013 12:23 PM

Hi,

On Tue, Nov 26, 2013 at 11:41 AM, Noah Silverman <noahsilverman at ucla.edu> wrote:

A few quick ones.

You had said you tried data.table and found it to be slow still -- my
guess is that you might not have used it correctly, so here is a rough
sketch of what to do.

Let's assume that your date is converted to some integer -- I will
leave that excercise to you :-) -- but it seems like you just want to
calculate number of (whole) days since an event that you have a record
for, so this should be (in principle) easy to do (if you really need
full power of "date math", data.table supports that as well).

Also you never "reset" your `temp` variable, so it looks like you are
carrying over `temp` from one `id` group to the next (and, while I
have no knowledge of your problem, I would imagine this is not what
you want to do)

Anyway some rough ideas to get you started:

R> d <- as.data.table(d)
R> setkeyv(d, c('id', 'date'))

Now records within each date are ordered from first to last.

The specifics of your decay score escape me a bit, eg. what is the
value of "days_since" for the first record of each id? I'll let you
figure that out, but in the non-edge cases, it looks like you can just
calculate "days since" by subtracting the current date from the date
recorded in the record before it. (Note that `.I` is special
data.table variable for the row number of a given record in the
original data.table):

d[, newScore := {
  ## handle edge case for first record w/in each `id` group
  days_since <- date - d$date[.I -1]
  w <- exp(-days_since / decay)
  ## ...
  ## Some other stuff you are doing here which I can't
  ## understand with temp ... then multiple the 'score' column
  ## for the given row by the your correctly calculated weight `w`
  ## for that row (whatever it might be).
  w * score
}, by='id']

HTH,
-steve

Steve Lianoglou
Computational Biologist
Genentech

Thoughts for faster indexing

Thread (13 messages)