Very slow subsetting by name
On 07/15/2010 08:38 AM, Martin Morgan wrote:
On 07/15/2010 01:12 AM, Herv? Pag?s wrote:
Hi,
I'm subsetting a named vector using character indices.
My vector of indices (or keys) is 10x longer than the vector
I'm subsetting. All my keys are distinct and only 10% of them
are valid (i.e. match a name of the vector being subsetted).
It is surprisingly slow:
x1 <- 1:1000
names(x1) <- paste("a", x1, sep="")
keys <- sample(c(names(x1), paste("b", 1:9000, sep="")))
system.time(y1 <- x1[keys])
user system elapsed
0.410 0.000 0.416
x2 <- 1:2000
names(x2) <- paste("a", x2, sep="")
keys <- sample(c(names(x2), paste("b", 1:18000, sep="")))
system.time(y2 <- x2[keys])
user system elapsed 1.730 0.000 1.736
For what its worth, I think this comes about in the loop starting at subscript.c:538, which seems to be there to allow [<-,*,character to extend a vector with new named elements
x=c(a=1) x["b"] = 2 x
a b 1 2 It seems to be irrelevant (?) for sub-setting per se (though by analogy one might expect x["c"] to return a length-1 vector NA with name "c", whereas it returns a vector with names NA). Seems like the O(n^2) loop through NonNullStringMatch could be replaced by look-ups into a hash, or an additional argument could be propagated
this passes make check and does
x4 <- 1:8000
names(x4) <- paste("a", x4, sep="")
keys <- sample(c(names(x4), paste("b", 1:72000, sep="")))
system.time(y4 <- x4[keys])
user system elapsed 0.092 0.000 0.093
identical(y4, x4[match(keys, names(x4))])
[1] TRUE
but uses some additional memory.
Martin
Index: src/main/subscript.c
===================================================================
--- src/main/subscript.c (revision 52526)
+++ src/main/subscript.c (working copy)
@@ -535,15 +535,17 @@
}
+ SEXP sindx = PROTECT(match(s, s, 0)); /* first match */
for (i = 0; i < ns; i++) {
sub = INTEGER(indx)[i];
if (sub == 0) {
- for (j = 0 ; j < i ; j++)
- if (NonNullStringMatch(STRING_ELT(s, i), STRING_ELT(s, j))) {
- sub = INTEGER(indx)[j];
- SET_VECTOR_ELT(indexnames, i, STRING_ELT(s, j));
- break;
- }
+ j = INTEGER(sindx)[i] - 1;
+ if (NA_STRING != STRING_ELT(s, j) &&
+ R_NilValue != STRING_ELT(s, j))
+ {
+ sub = INTEGER(indx)[j];
+ SET_VECTOR_ELT(indexnames, i, STRING_ELT(s, j));
+ }
}
if (sub == 0) {
if (!canstretch) {
@@ -561,7 +563,7 @@
setAttrib(indx, R_UseNamesSymbol, indexnames);
if (canstretch)
*stretch = extra;
- UNPROTECT(4);
+ UNPROTECT(5);
return indx;
}
to stringSubscript to exit early when names aren't required. Or the call to makeSubscript at subset.c:164 could instead be made to matchE in unique.c. Martin
x3 <- 1:4000
names(x3) <- paste("a", x3, sep="")
keys <- sample(c(names(x3), paste("b", 1:36000, sep="")))
system.time(y3 <- x3[keys])
user system elapsed
8.900 0.010 9.227
x4 <- 1:8000
names(x4) <- paste("a", x4, sep="")
keys <- sample(c(names(x4), paste("b", 1:72000, sep="")))
system.time(y4 <- x4[keys])
user system elapsed 130.390 0.000 132.316 And it's apparently worse than quadratic in time! I'm wondering why this subsetting by name is so slow since it seems it could be implemented with x4[match(keys, names(x4))], which is very fast: only 0.012s! This is with R-2.11.0 and R-2.12.0. Thanks, H.
Martin Morgan Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793