Skip to content
Prev 333493 / 398506 Next

Thoughts for faster indexing

I have some processes where I do the same thing, iterate over subsets of a
data frame.
My data frame has ~250,000 rows, 30 variables, and the subsets are such
that there are about 6000 of them.

Performing a which() statement like yours seems quite fast.

For example, wrapping unix.time() around the which() expression, I get

   user  system elapsed   0.008   0.000   0.008

It's hard for me to imagine the single task of getting the indexes is slow
enough to be a bottleneck.



On the other hand, if the variable being used to identify subsets is a
factor with many levels (~6000 in my case), it is noticeably slower.

   user  system elapsed
  0.024   0.002   0.026


I haven't tested it, and have no real expectation that it will make a
difference, but perhaps sorting by the index variable before iterating
will help (if you haven't already). Since these are not true indexes in
the sense used by relational database systems, maybe it will make a
difference.