Dear all,
I have seen that others have discussed the partial matching behaviour of
data.frame[idx,] in the past, in particular with respect to unexpected
results sets.
I am aware of the fact that one can work around this using either
match() or switching to tibble/data.table or similar altogether.
I have a different issue with the partial matching, in particular its
performance when used on large data frames or more specifically, with
large queries matched against its row names.
I came across a case where I wanted to extract data from a large table
(approx 1M rows) using an index which matched only about 50% to the row
names, i.e. about 50% row name hits and 50% misses.
What was unexpected is that in this case was that [.data.frame was
hanging for a long time (I waited about 10 minutes and then restarted
R). Also, this cannot be interrupted in interactive mode.
ids <- paste0("cg", sprintf("%06d",0:(1e6-1)))
d1 <- data.frame(row.names=ids, v=1:(1e6) )
q1 <- sample(ids, 1e6, replace=F)
system.time({r <- d1[q1,,drop=F]})
#?? user? system elapsed
#? 0.464?? 0.000?? 0.465
# those will hang a long time, I stopped R after 10 minutes
q2 <- c(q1[1:5e5], gsub("cg", "ct", q1[(5e5+1):1e6]) )
system.time({r <- d1[q2,,drop=F]})
# same here
q3 <- c(q1[1:5e5], rep("FOO",5e5) )
system.time({r <- d1[q3,,drop=F]})
It seems that the penalty of partial matching the non-hits across the
whole row name vector is not negligible any more with large tables and
queries, compared to small and medium tables.
I checked and pmatch(q2, rownames(d1) is equally slow.
Is there a chance to a) document this in the help page ("with large
indexes/tables use match()") or even better b) add an exact flag to
[.data.frame ?
Thanks a lot!
Best regards
Hilmar
Partial matching performance in data frame rownames using [
7 messages · Ivan Krylov, Hilmar Berger, Toby Hocking
? Mon, 11 Dec 2023 21:11:48 +0100 Hilmar Berger via R-devel <r-devel at r-project.org> ?????:
What was unexpected is that in this case was that [.data.frame was hanging for a long time (I waited about 10 minutes and then restarted R). Also, this cannot be interrupted in interactive mode.
That's unfortunate. If an operation takes a long time, it ought to be
interruptible. Here's a patch that passes make check-devel:
--- src/main/unique.c (revision 85667)
+++ src/main/unique.c (working copy)
@@ -1631,6 +1631,7 @@
}
}
+ unsigned int ic = 9999;
if(nexact < n_input) {
/* Second pass, partial matching */
for (R_xlen_t i = 0; i < n_input; i++) {
@@ -1642,6 +1643,10 @@
mtch = 0;
mtch_count = 0;
for (int j = 0; j < n_target; j++) {
+ if (!--ic) {
+ R_CheckUserInterrupt();
+ ic = 9999;
+ }
if (no_dups && used[j]) continue;
if (strncmp(ss, tar[j], temp) == 0) {
mtch = j + 1;
Best regards, Ivan
Dear Ivan, thanks a lot, that is helpful. Still, I feel that default partial matching cripples the functionality of data.frame for larger tables. Thanks again and best regards Hilmar
On 12.12.23 13:55, Ivan Krylov wrote:
? Mon, 11 Dec 2023 21:11:48 +0100 Hilmar Berger via R-devel <r-devel at r-project.org> ?????:
What was unexpected is that in this case was that [.data.frame was hanging for a long time (I waited about 10 minutes and then restarted R). Also, this cannot be interrupted in interactive mode.
That's unfortunate. If an operation takes a long time, it ought to be
interruptible. Here's a patch that passes make check-devel:
--- src/main/unique.c (revision 85667)
+++ src/main/unique.c (working copy)
@@ -1631,6 +1631,7 @@
}
}
+ unsigned int ic = 9999;
if(nexact < n_input) {
/* Second pass, partial matching */
for (R_xlen_t i = 0; i < n_input; i++) {
@@ -1642,6 +1643,10 @@
mtch = 0;
mtch_count = 0;
for (int j = 0; j < n_target; j++) {
+ if (!--ic) {
+ R_CheckUserInterrupt();
+ ic = 9999;
+ }
if (no_dups && used[j]) continue;
if (strncmp(ss, tar[j], temp) == 0) {
mtch = j + 1;
3 days later
On Wed, 13 Dec 2023 09:04:18 +0100
Hilmar Berger via R-devel <r-devel at r-project.org> wrote:
Still, I feel that default partial matching cripples the functionality of data.frame for larger tables.
Changing the default now would require a long deprecation cycle to give everyone who uses `[.data.frame` and relies on partial matching (whether they know it or not) enough time to adjust. Still, adding an argument feels like a small change: edit https://svn.r-project.org/R/trunk/src/library/base/R/dataframe.R and add a condition before calling pmatch(). Adjust the warning() for named arguments. Don't forget to document the new argument in the man page at https://svn.r-project.org/R/trunk/src/library/base/man/Extract.data.frame.Rd Index: src/library/base/R/dataframe.R =================================================================== --- src/library/base/R/dataframe.R (revision 85664) +++ src/library/base/R/dataframe.R (working copy) @@ -591,14 +591,14 @@ ### These are a little less general than S `[.data.frame` <- - function(x, i, j, drop = if(missing(i)) TRUE else length(cols) == 1) + function(x, i, j, drop = if(missing(i)) TRUE else length(cols) == 1, pmatch.rows = TRUE) { mdrop <- missing(drop) Narg <- nargs() - !mdrop # number of arg from x,i,j that were specified has.j <- !missing(j) - if(!all(names(sys.call()) %in% c("", "drop")) + if(!all(names(sys.call()) %in% c("", "drop", "pmatch.rows")) && !isS4(x)) # at least don't warn for callNextMethod! - warning("named arguments other than 'drop' are discouraged") + warning("named arguments other than 'drop', 'pmatch.rows' are discouraged") if(Narg < 3L) { # list-like indexing or matrix indexing if(!mdrop) warning("'drop' argument will be ignored") @@ -679,7 +679,11 @@ ## for consistency with [, <length-1>] if(is.character(i)) { rows <- attr(xx, "row.names") - i <- pmatch(i, rows, duplicates.ok = TRUE) + i <- if (pmatch.rows) { + pmatch(i, rows, duplicates.ok = TRUE) + } else { + match(i, rows) + } } ## need to figure which col was selected: ## cannot use .subset2 directly as that may @@ -699,7 +703,11 @@ # as this can be expensive. if(is.character(i)) { rows <- attr(xx, "row.names") - i <- pmatch(i, rows, duplicates.ok = TRUE) + i <- if (pmatch.rows) { + pmatch(i, rows, duplicates.ok = TRUE) + } else { + match(i, rows) + } } for(j in seq_along(x)) { xj <- xx[[ sxx[j] ]] Index: src/library/base/man/Extract.data.frame.Rd =================================================================== --- src/library/base/man/Extract.data.frame.Rd (revision 85664) +++ src/library/base/man/Extract.data.frame.Rd (working copy) @@ -15,7 +15,7 @@ Extract or replace subsets of data frames. } \usage{ -\method{[}{data.frame}(x, i, j, drop = ) +\method{[}{data.frame}(x, i, j, drop =, pmatch.rows = TRUE) \method{[}{data.frame}(x, i, j) <- value \method{[[}{data.frame}(x, ..., exact = TRUE) \method{[[}{data.frame}(x, i, j) <- value @@ -45,6 +45,9 @@ column is selected.} \item{exact}{logical: see \code{\link{[}}, and applies to column names.} + + \item{pmatch.rows}{logical: whether to perform partial matching on + row names in case \code{i} is a character vector.} } \details{ Data frames can be indexed in several modes. When \code{[} and system.time({r <- d1[q2,, drop=FALSE, pmatch.rows = FALSE]}) # user system elapsed # 0.478 0.004 0.482 Unfortunately, that would be only the beginning. The prose in the whole ?`[.data.frame` would have to be updated; the new behaviour would have to be tested in tests/**.R. There may be very good reasons why named arguments to `[` other than drop= are discouraged for data.frames. I'm afraid I lack the whole-project view to consider whether such an addition would be safe.
Best regards, Ivan
3 days later
Hi Hilmar and Ivan, I have used your code examples to write a blog post about this topic, which has figures that show the asymptotic time complexity of the various approaches, https://tdhock.github.io/blog/2023/df-partial-match/ The asymptotic complexity of partial matching appears to be quadratic O(N^2) whereas the other approaches are asymptotically faster: linear O(N) or log-linear O(N log N). I think that accepting Ivan's pmatch.rows patch would add un-necessary complexity to base R, since base R already provides an efficient work-around, d1[match(q1,rownames(d1)),] I do think the CheckUserInterrupt patch is a good idea, though. Best, Toby
On Sat, Dec 16, 2023 at 2:49?AM Ivan Krylov <krylov.r00t at gmail.com> wrote:
On Wed, 13 Dec 2023 09:04:18 +0100 Hilmar Berger via R-devel <r-devel at r-project.org> wrote:
Still, I feel that default partial matching cripples the functionality of data.frame for larger tables.
Changing the default now would require a long deprecation cycle to give everyone who uses `[.data.frame` and relies on partial matching (whether they know it or not) enough time to adjust. Still, adding an argument feels like a small change: edit https://svn.r-project.org/R/trunk/src/library/base/R/dataframe.R and add a condition before calling pmatch(). Adjust the warning() for named arguments. Don't forget to document the new argument in the man page at https://svn.r-project.org/R/trunk/src/library/base/man/Extract.data.frame.Rd Index: src/library/base/R/dataframe.R =================================================================== --- src/library/base/R/dataframe.R (revision 85664) +++ src/library/base/R/dataframe.R (working copy) @@ -591,14 +591,14 @@ ### These are a little less general than S `[.data.frame` <- - function(x, i, j, drop = if(missing(i)) TRUE else length(cols) == 1) + function(x, i, j, drop = if(missing(i)) TRUE else length(cols) == 1, pmatch.rows = TRUE) { mdrop <- missing(drop) Narg <- nargs() - !mdrop # number of arg from x,i,j that were specified has.j <- !missing(j) - if(!all(names(sys.call()) %in% c("", "drop")) + if(!all(names(sys.call()) %in% c("", "drop", "pmatch.rows")) && !isS4(x)) # at least don't warn for callNextMethod! - warning("named arguments other than 'drop' are discouraged") + warning("named arguments other than 'drop', 'pmatch.rows' are discouraged") if(Narg < 3L) { # list-like indexing or matrix indexing if(!mdrop) warning("'drop' argument will be ignored") @@ -679,7 +679,11 @@ ## for consistency with [, <length-1>] if(is.character(i)) { rows <- attr(xx, "row.names") - i <- pmatch(i, rows, duplicates.ok = TRUE) + i <- if (pmatch.rows) { + pmatch(i, rows, duplicates.ok = TRUE) + } else { + match(i, rows) + } } ## need to figure which col was selected: ## cannot use .subset2 directly as that may @@ -699,7 +703,11 @@ # as this can be expensive. if(is.character(i)) { rows <- attr(xx, "row.names") - i <- pmatch(i, rows, duplicates.ok = TRUE) + i <- if (pmatch.rows) { + pmatch(i, rows, duplicates.ok = TRUE) + } else { + match(i, rows) + } } for(j in seq_along(x)) { xj <- xx[[ sxx[j] ]] Index: src/library/base/man/Extract.data.frame.Rd =================================================================== --- src/library/base/man/Extract.data.frame.Rd (revision 85664) +++ src/library/base/man/Extract.data.frame.Rd (working copy) @@ -15,7 +15,7 @@ Extract or replace subsets of data frames. } \usage{ -\method{[}{data.frame}(x, i, j, drop = ) +\method{[}{data.frame}(x, i, j, drop =, pmatch.rows = TRUE) \method{[}{data.frame}(x, i, j) <- value \method{[[}{data.frame}(x, ..., exact = TRUE) \method{[[}{data.frame}(x, i, j) <- value @@ -45,6 +45,9 @@ column is selected.} \item{exact}{logical: see \code{\link{[}}, and applies to column names.} + + \item{pmatch.rows}{logical: whether to perform partial matching on + row names in case \code{i} is a character vector.} } \details{ Data frames can be indexed in several modes. When \code{[} and system.time({r <- d1[q2,, drop=FALSE, pmatch.rows = FALSE]}) # user system elapsed # 0.478 0.004 0.482 Unfortunately, that would be only the beginning. The prose in the whole ?`[.data.frame` would have to be updated; the new behaviour would have to be tested in tests/**.R. There may be very good reasons why named arguments to `[` other than drop= are discouraged for data.frames. I'm afraid I lack the whole-project view to consider whether such an addition would be safe. -- Best regards, Ivan
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Hi Hilmar and Ivan, I have used your code examples to write a blog post about this topic, which has figures that show the asymptotic time complexity of the various approaches, https://tdhock.github.io/blog/2023/df-partial-match/ The asymptotic complexity of partial matching appears to be quadratic O(N^2) whereas the other approaches are asymptotically faster: linear O(N) or log-linear O(N log N). I think that accepting Ivan's pmatch.rows patch would add un-necessary complexity to base R, since base R already provides an efficient work-around, d1[match(q1,rownames(d1)),] I do think the CheckUserInterrupt patch is a good idea, though. Best, Toby
On Sat, Dec 16, 2023 at 2:49?AM Ivan Krylov <krylov.r00t at gmail.com> wrote:
On Wed, 13 Dec 2023 09:04:18 +0100 Hilmar Berger via R-devel <r-devel at r-project.org> wrote:
Still, I feel that default partial matching cripples the functionality of data.frame for larger tables.
Changing the default now would require a long deprecation cycle to give everyone who uses `[.data.frame` and relies on partial matching (whether they know it or not) enough time to adjust. Still, adding an argument feels like a small change: edit https://svn.r-project.org/R/trunk/src/library/base/R/dataframe.R and add a condition before calling pmatch(). Adjust the warning() for named arguments. Don't forget to document the new argument in the man page at https://svn.r-project.org/R/trunk/src/library/base/man/Extract.data.frame.Rd Index: src/library/base/R/dataframe.R =================================================================== --- src/library/base/R/dataframe.R (revision 85664) +++ src/library/base/R/dataframe.R (working copy) @@ -591,14 +591,14 @@ ### These are a little less general than S `[.data.frame` <- - function(x, i, j, drop = if(missing(i)) TRUE else length(cols) == 1) + function(x, i, j, drop = if(missing(i)) TRUE else length(cols) == 1, pmatch.rows = TRUE) { mdrop <- missing(drop) Narg <- nargs() - !mdrop # number of arg from x,i,j that were specified has.j <- !missing(j) - if(!all(names(sys.call()) %in% c("", "drop")) + if(!all(names(sys.call()) %in% c("", "drop", "pmatch.rows")) && !isS4(x)) # at least don't warn for callNextMethod! - warning("named arguments other than 'drop' are discouraged") + warning("named arguments other than 'drop', 'pmatch.rows' are discouraged") if(Narg < 3L) { # list-like indexing or matrix indexing if(!mdrop) warning("'drop' argument will be ignored") @@ -679,7 +679,11 @@ ## for consistency with [, <length-1>] if(is.character(i)) { rows <- attr(xx, "row.names") - i <- pmatch(i, rows, duplicates.ok = TRUE) + i <- if (pmatch.rows) { + pmatch(i, rows, duplicates.ok = TRUE) + } else { + match(i, rows) + } } ## need to figure which col was selected: ## cannot use .subset2 directly as that may @@ -699,7 +703,11 @@ # as this can be expensive. if(is.character(i)) { rows <- attr(xx, "row.names") - i <- pmatch(i, rows, duplicates.ok = TRUE) + i <- if (pmatch.rows) { + pmatch(i, rows, duplicates.ok = TRUE) + } else { + match(i, rows) + } } for(j in seq_along(x)) { xj <- xx[[ sxx[j] ]] Index: src/library/base/man/Extract.data.frame.Rd =================================================================== --- src/library/base/man/Extract.data.frame.Rd (revision 85664) +++ src/library/base/man/Extract.data.frame.Rd (working copy) @@ -15,7 +15,7 @@ Extract or replace subsets of data frames. } \usage{ -\method{[}{data.frame}(x, i, j, drop = ) +\method{[}{data.frame}(x, i, j, drop =, pmatch.rows = TRUE) \method{[}{data.frame}(x, i, j) <- value \method{[[}{data.frame}(x, ..., exact = TRUE) \method{[[}{data.frame}(x, i, j) <- value @@ -45,6 +45,9 @@ column is selected.} \item{exact}{logical: see \code{\link{[}}, and applies to column names.} + + \item{pmatch.rows}{logical: whether to perform partial matching on + row names in case \code{i} is a character vector.} } \details{ Data frames can be indexed in several modes. When \code{[} and system.time({r <- d1[q2,, drop=FALSE, pmatch.rows = FALSE]}) # user system elapsed # 0.478 0.004 0.482 Unfortunately, that would be only the beginning. The prose in the whole ?`[.data.frame` would have to be updated; the new behaviour would have to be tested in tests/**.R. There may be very good reasons why named arguments to `[` other than drop= are discouraged for data.frames. I'm afraid I lack the whole-project view to consider whether such an addition would be safe. -- Best regards, Ivan
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
1 day later
Dear Toby and Ivan, thanks a lot for the proposed patch and this detailed analysis. The timing analysis nicely shows what I suspected - that partial matching in large tables (>>10^5 rows) can get prohibitively slow. For 10^6 rows with 50% non-hits in exact matching I roughly would expect 10,000 seconds, i.e. almost 3h. That would be quite slow even if one would want partial matching. My suspicion, however, is that most users do not want partial matching at all and use row name indexing using character vectors in the same way as applied in data.table or tibble, i.e. as a unique key to table rows. I can't remember a valid use case where I would have used partial matching for a rowname index in the last 10 years, and I would be happy to learn how widespread such use cases are. Regarding the workaround, I do not fully agree that adding match() to the call of [.data.frame() would be a preferable solution. In cases where one cannot exclude that the data.frame will grow large and that there might be considerable proportions of non-hits in exact matching, the workaround would have to applied always in order to achieve a predictable performance. Which, in my opinion, raises the question if and when the ordinary, partial matching option would be still applicable. I am not knowledgeable to say how much work it would be, but I believe that we could test the impact of Ivan's proposed solution by running CRAN/BioC package tests against a patched R compared to an unpatched one. I can offer to have a look at failing test cases to see if those are intentional or unintentional uses of partial matching. Best regards Hilmar
On 19.12.23 21:57, Toby Hocking wrote:
Hi Hilmar and Ivan, I have used your code examples to write a blog post about this topic, which has figures that show the asymptotic time complexity of the various approaches, https://tdhock.github.io/blog/2023/df-partial-match/ The asymptotic complexity of partial matching appears to be quadratic O(N^2) whereas the other approaches are asymptotically faster: linear O(N) or log-linear O(N log N). I think that accepting Ivan's pmatch.rows patch would add un-necessary complexity to base R, since base R already provides an efficient work-around, d1[match(q1,rownames(d1)),] I do think the CheckUserInterrupt patch is a good idea, though. Best, Toby On Sat, Dec 16, 2023 at 2:49?AM Ivan Krylov <krylov.r00t at gmail.com> wrote:
On Wed, 13 Dec 2023 09:04:18 +0100 Hilmar Berger via R-devel <r-devel at r-project.org> wrote:
Still, I feel that default partial matching cripples the functionality of data.frame for larger tables.
Changing the default now would require a long deprecation cycle to give everyone who uses `[.data.frame` and relies on partial matching (whether they know it or not) enough time to adjust. Still, adding an argument feels like a small change: edit https://svn.r-project.org/R/trunk/src/library/base/R/dataframe.R and add a condition before calling pmatch(). Adjust the warning() for named arguments. Don't forget to document the new argument in the man page at https://svn.r-project.org/R/trunk/src/library/base/man/Extract.data.frame.Rd Index: src/library/base/R/dataframe.R =================================================================== --- src/library/base/R/dataframe.R (revision 85664) +++ src/library/base/R/dataframe.R (working copy) @@ -591,14 +591,14 @@ ### These are a little less general than S `[.data.frame` <- - function(x, i, j, drop = if(missing(i)) TRUE else length(cols) == 1) + function(x, i, j, drop = if(missing(i)) TRUE else length(cols) == 1, pmatch.rows = TRUE) { mdrop <- missing(drop) Narg <- nargs() - !mdrop # number of arg from x,i,j that were specified has.j <- !missing(j) - if(!all(names(sys.call()) %in% c("", "drop")) + if(!all(names(sys.call()) %in% c("", "drop", "pmatch.rows")) && !isS4(x)) # at least don't warn for callNextMethod! - warning("named arguments other than 'drop' are discouraged") + warning("named arguments other than 'drop', 'pmatch.rows' are discouraged") if(Narg < 3L) { # list-like indexing or matrix indexing if(!mdrop) warning("'drop' argument will be ignored") @@ -679,7 +679,11 @@ ## for consistency with [, <length-1>] if(is.character(i)) { rows <- attr(xx, "row.names") - i <- pmatch(i, rows, duplicates.ok = TRUE) + i <- if (pmatch.rows) { + pmatch(i, rows, duplicates.ok = TRUE) + } else { + match(i, rows) + } } ## need to figure which col was selected: ## cannot use .subset2 directly as that may @@ -699,7 +703,11 @@ # as this can be expensive. if(is.character(i)) { rows <- attr(xx, "row.names") - i <- pmatch(i, rows, duplicates.ok = TRUE) + i <- if (pmatch.rows) { + pmatch(i, rows, duplicates.ok = TRUE) + } else { + match(i, rows) + } } for(j in seq_along(x)) { xj <- xx[[ sxx[j] ]] Index: src/library/base/man/Extract.data.frame.Rd =================================================================== --- src/library/base/man/Extract.data.frame.Rd (revision 85664) +++ src/library/base/man/Extract.data.frame.Rd (working copy) @@ -15,7 +15,7 @@ Extract or replace subsets of data frames. } \usage{ -\method{[}{data.frame}(x, i, j, drop = ) +\method{[}{data.frame}(x, i, j, drop =, pmatch.rows = TRUE) \method{[}{data.frame}(x, i, j) <- value \method{[[}{data.frame}(x, ..., exact = TRUE) \method{[[}{data.frame}(x, i, j) <- value @@ -45,6 +45,9 @@ column is selected.} \item{exact}{logical: see \code{\link{[}}, and applies to column names.} + + \item{pmatch.rows}{logical: whether to perform partial matching on + row names in case \code{i} is a character vector.} } \details{ Data frames can be indexed in several modes. When \code{[} and system.time({r <- d1[q2,, drop=FALSE, pmatch.rows = FALSE]}) # user system elapsed # 0.478 0.004 0.482 Unfortunately, that would be only the beginning. The prose in the whole ?`[.data.frame` would have to be updated; the new behaviour would have to be tested in tests/**.R. There may be very good reasons why named arguments to `[` other than drop= are discouraged for data.frames. I'm afraid I lack the whole-project view to consider whether such an addition would be safe. -- Best regards, Ivan
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel