Skip to content

Thoughts for faster indexing

13 messages · Noah Silverman, jim holtman, Carl Witthoft +9 more

#
Hello,

I have a fairly large data.frame.  (About 150,000 rows of 100
variables.) There are case IDs, and multiple entries for each ID, with a
date stamp.  (i.e. records of peoples activity.)


I need to iterate over each person (record ID) in the data set, and then
process their data for each date.  The processing part is fast, the date
part is fast.  Locating the records is slow.  I've even tried using
data.table, with ID set as the index, and it is still slow.

The line with the slow process (According to Rprof) is:


j <- which( d$id == person )

(I then process all the records indexed by j, which seems fast enough.)

where d is my data.frame or data.table

I thought that using the data.table indexing would speed things up, but
not in this case.

Any ideas on how to speed this up?


Thanks!
#
you need to show the statement in context with the rest of the script.  you need to tell us what you want to do, not how you want to do it.  

Sent from my iPad
On Nov 20, 2013, at 15:16, Noah Silverman <noahsilverman at g.ucla.edu> wrote:

            
#
What the Data Munger Guru said.
Plus: this is almost certainly a job for ddply or data.table.



Noah Silverman-2 wrote

            
--
View this message in context: http://r.789695.n4.nabble.com/Thoughts-for-faster-indexing-tp4680854p4680889.html
Sent from the R help mailing list archive at Nabble.com.
#
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On 11/21/13, 12:34 , Jim Holtman wrote:
Agreed - a few details will result in guesses (see my guess below)
Possibly use

d_by_id <- split(d, d$id)

which splits the data.frame d into a listt, where each list represents
the data.frame of one id.

But: Just a guess.

Cheers,

Rainer
- -- 
Rainer M. Krug, PhD (Conservation Ecology, SUN), MSc (Conservation
Biology, UCT), Dipl. Phys. (Germany)

Centre of Excellence for Invasion Biology
Stellenbosch University
South Africa

Tel :       +33 - (0)9 53 10 27 44
Cell:       +33 - (0)6 85 62 59 98
Fax :       +33 - (0)9 58 10 27 44

Fax (D):    +49 - (0)3 21 21 25 22 44

email:      Rainer at krugs.de

Skype:      RMkrug
-----BEGIN PGP SIGNATURE-----
Version: GnuPG/MacGPG2 v2.0.22 (Darwin)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQEcBAEBAgAGBQJSjg/+AAoJENvXNx4PUvmC/kcH/3eaMvOTCAvA6ewwH/XJHH6X
B4BgstscvvJ3yArFeWqLIV0CEgk3da4c28+4Jk50vnltRVwUieFxKA1UK6Ef3gPl
pZUg9TaUNAeHPfkrxQSrYIa+hLWMZ1Ybe6GM1OlXnkc9ZBT9KS+rX3HFfr9rdyFI
Rv7SgrylUnpZIyiMeAzQS/FBzozV3G6mGu8FJ8YW5mHCqajI2alK3B3BBREzuLsL
ZMSuFDPTzxrE63O+uU6yFDibhz/4chKVz6CEF52WUgpgP+X4rW/DcLDrDfXxEvwM
ZDHcOZ8FJsuDl1lb1bdzSyS61KfzWls37i9VtOozqQwSFbaHbcdV16jHCPDzRPA=
=u5ol
-----END PGP SIGNATURE-----
#
.... or use tapply(seq_len(nrow(d),d$id,...) or a wrapper version
thereof (by, aggregate,...)

However, it would not surprise me if this does not help. I suspect
that the problem is not what you think but in the code and context you
omitted, as others have already noted.

-- Bert
On Thu, Nov 21, 2013 at 5:51 AM, Rainer M Krug <Rainer at krugs.de> wrote:

  
    
#
I have some processes where I do the same thing, iterate over subsets of a
data frame.
My data frame has ~250,000 rows, 30 variables, and the subsets are such
that there are about 6000 of them.

Performing a which() statement like yours seems quite fast.

For example, wrapping unix.time() around the which() expression, I get

   user  system elapsed   0.008   0.000   0.008

It's hard for me to imagine the single task of getting the indexes is slow
enough to be a bottleneck.



On the other hand, if the variable being used to identify subsets is a
factor with many levels (~6000 in my case), it is noticeably slower.

   user  system elapsed
  0.024   0.002   0.026


I haven't tested it, and have no real expectation that it will make a
difference, but perhaps sorting by the index variable before iterating
will help (if you haven't already). Since these are not true indexes in
the sense used by relational database systems, maybe it will make a
difference.
#
Hi,
On Nov 21, 2013, at 10:42 AM, "MacQueen, Don" <macqueen1 at llnl.gov> wrote:

            
You might also want to check this out?

http://adv-r.had.co.nz/Performance.html

Cheers,
Ben
Ben Tupper
Bigelow Laboratory for Ocean Sciences
60 Bigelow Drive, P.O. Box 380
East Boothbay, Maine 04544
http://www.bigelow.org
#
Using split() once (and using its output in a loop) instead of == applied to
a long vector many times, as in
   for(j in split(seq_along(d$id), people)) {
       # newdata[j,] <- process(data[j,])
   }
is typically faster.  But this is the sort of thing that tapply() and the functions
in package:plyr do for you.

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
#
Neal Fultz <nfultz <at> gmail.com> writes:
You might also take a look at the 'dplyr' package on Github: it's
next-gen plyr, engineered for performance ...

https://github.com/hadley/dplyr
#
Not sure this helps but...
 
######
# data frame with 30,000 ID's, each with 5 "dates", plus some random data...
df <- data.frame(id=rep(1:30000, each=5), 
                                  date=rep(1:5, each=30000),
                                  x=rnorm(150000), y=rnorm(150000, mean=1),z=rnorm(150000,mean=3))
dt <- data.table(dt, key=id)      # note you have to set the  key...

# No difference when using which
system.time(for (i in 1:300) {j <- which(df$id==i)})
  user  system elapsed
  0.73    0.06    0.79

system.time(for (i in 1:300) {j <- which(dt$id==i)})
  user  system elapsed
  0.69    0.04    0.76

# 20 X faster using joins
system.time(for (i in 1:300) {select <- df[df$id==i,]})
  user  system elapsed
  19.25    0.36   19.64 
system.time(for (i in 1:300) {select <- dt[id==i,]})
  user  system elapsed
  4.32    0.11    4.45 
system.time(for (i in 1:300) {select <- dt[J(i)]})
  user  system elapsed
  0.88    0.00    0.88
######

Note that extracting select with a data table join still took longer than generating an "index" using which, but having all the
columns in one step, instead of just the index might speed up later operations.


-----Original Message-----
From: Noah Silverman [mailto:noahsilverman at g.ucla.edu] 
Sent: Wednesday, November 20, 2013 3:17 PM
To: 'R-help at r-project.org'
Subject: [R] Thoughts for faster indexing

Hello,

I have a fairly large data.frame.  (About 150,000 rows of 100
variables.) There are case IDs, and multiple entries for each ID, with a date stamp.  (i.e. records of peoples activity.)


I need to iterate over each person (record ID) in the data set, and then process their data for each date.  The processing part is
fast, the date part is fast.  Locating the records is slow.  I've even tried using data.table, with ID set as the index, and it is
still slow.

The line with the slow process (According to Rprof) is:


j <- which( d$id == person )

(I then process all the records indexed by j, which seems fast enough.)

where d is my data.frame or data.table

I thought that using the data.table indexing would speed things up, but not in this case.

Any ideas on how to speed this up?


Thanks!

--
Noah Silverman, M.S., C.Phil
UCLA Department of Statistics
8117 Math Sciences Building
Los Angeles, CA 90095
4 days later
#
Hi,
On Tue, Nov 26, 2013 at 11:41 AM, Noah Silverman <noahsilverman at ucla.edu> wrote:
A few quick ones.

You had said you tried data.table and found it to be slow still -- my
guess is that you might not have used it correctly, so here is a rough
sketch of what to do.

Let's assume that your date is converted to some integer -- I will
leave that excercise to you :-) -- but it seems like you just want to
calculate number of (whole) days since an event that you have a record
for, so this should be (in principle) easy to do (if you really need
full power of "date math", data.table supports that as well).

Also you never "reset" your `temp` variable, so it looks like you are
carrying over `temp` from one `id` group to the next (and, while I
have no knowledge of your problem, I would imagine this is not what
you want to do)

Anyway some rough ideas to get you started:

R> d <- as.data.table(d)
R> setkeyv(d, c('id', 'date'))

Now records within each date are ordered from first to last.

The specifics of your decay score escape me a bit, eg. what is the
value of "days_since" for the first record of each id? I'll let you
figure that out, but in the non-edge cases, it looks like you can just
calculate "days since" by subtracting the current date from the date
recorded in the record before it. (Note that `.I` is special
data.table variable for the row number of a given record in the
original data.table):

d[, newScore := {
  ## handle edge case for first record w/in each `id` group
  days_since <- date - d$date[.I -1]
  w <- exp(-days_since / decay)
  ## ...
  ## Some other stuff you are doing here which I can't
  ## understand with temp ... then multiple the 'score' column
  ## for the given row by the your correctly calculated weight `w`
  ## for that row (whatever it might be).
  w * score
}, by='id']

HTH,
-steve