Good evening,
I recently have observed slow merges when combining multiple data frames
derived from DataFrame and base::data.frame. I observed that the index
column of intermediate tables was of class <AsIs> (automatically
converted from character). The problem occurred mainly when using the
sorted = T option in base::merge.
This can be traced to xtfrm.AsIs being more than 100 times slower than
the comparable function for character vectors.
x = paste0("A_", 1:1e5)
system.time({o <- xtfrm(x)})
#? user? system elapsed
#? 0.325?? 0.005?? 0.332
x <- I(x)
system.time({o <- xtfrm(x)}) # this calls xtfrm.AsIs
# user? system elapsed
# 26.153?? 0.016? 26.388
This can be finally traced to base::rank() (called from xtfrm.default),
where I found that
"NB: rank is not itself generic but xtfrm is, and rank(xtfrm(x), ....)
will have the desired result if there is a xtfrm method. Otherwise, rank
will make use of ==, >, is.na and extraction methods for classed
objects, possibly rather slowly. "
This *sounds* like the existence of xtfrm.AsIs should already be able to
accelerate the ranking, but this does not seem to work. xtfrm.AsIs does
not do anything for my case of class(x) == "AsIs" and just delegates to
xtfrm.default.
As a quick solution (and if there is no other fix), could we possibly
add a note to the help page of I() that sorting/ordering/ranking of AsIs
columns will be rather slow?
Thanks a lot!
Best regards
Hilmar
> sessionInfo()
R version 4.4.1 (2024-06-14)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 20.04.6 LTS
Matrix products: default
BLAS:?? /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3;?
LAPACK version 3.9.0
locale:
?[1] LC_CTYPE=en_US.UTF-8?????? LC_NUMERIC=C
?[3] LC_TIME=de_DE.UTF-8??????? LC_COLLATE=en_US.UTF-8
?[5] LC_MONETARY=de_DE.UTF-8??? LC_MESSAGES=en_US.UTF-8
?[7] LC_PAPER=de_DE.UTF-8?????? LC_NAME=C
?[9] LC_ADDRESS=C?????????????? LC_TELEPHONE=C
[11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C
time zone: Europe/Berlin
tzcode source: system (glibc)
attached base packages:
[1] stats???? graphics? grDevices utils???? datasets? methods base
loaded via a namespace (and not attached):
[1] compiler_4.4.1
xftrm is more than 100x slower for AsIs than for character vectors
6 messages · Hilmar Berger, Ivan Krylov, H B +1 more
1 day later
? Fri, 12 Jul 2024 17:35:19 +0200 Hilmar Berger via R-devel <r-devel at r-project.org> ?????:
This can be finally traced to base::rank() (called from xtfrm.default), where I found that "NB: rank is not itself generic but xtfrm is, and rank(xtfrm(x), ....) will have the desired result if there is a xtfrm method. Otherwise, rank will make use of ==, >, is.na and extraction methods for classed objects, possibly rather slowly. "
The problem is indeed that the vector reaches base::rank in both cases, but since it has a class, the function has to construct and evaluate a call to .gt every time it wants to compare two elements. xtfrm.AsIs even tries to remove the 'AsIs' class before continuing the method dispatch process:
if (length(cl <- class(x)) > 1) oldClass(x) <- cl[-1L]
It doesn't work in the (very contrived) case when 'AsIs' is not the
first class and it doesn't remove 'AsIs' as the only class (making
static int equal(...) take the slower branch). What's going to break if
we allow removing the class attribute altogether? This seems to speed
up xtfrm(I(x)) and survive LC_ALL=C.UTF-8 make check-devel:
Index: src/library/base/R/sort.R
===================================================================
--- src/library/base/R/sort.R (revision 86895)
+++ src/library/base/R/sort.R (working copy)
@@ -297,7 +297,8 @@
xtfrm.AsIs <- function(x)
{
- if(length(cl <- class(x)) > 1) oldClass(x) <- cl[-1L]
+ cl <- oldClass(x)
+ oldClass(x) <- cl[cl != 'AsIs']
NextMethod("xtfrm")
}
Best regards, Ivan
Dear Ivan, thanks for the confirmation and the proposed patch. I just wanted to add some notes regarding the relevance of this: base::merge using by.x=0 or by.y=0 (i.e. matching on row.names) will automatically add a column Row.names which is I(row.names(x)) to the corresponding input table (using I() since revision 39026 to avoid conversion of character to factor). When this column is used for sorting (sort=TRUE by default in merge; should happen at least if all.x=T or all.y=T), this will result in slower execution. xtfrm.AsIs is unchanged since its addition in r50992 (likely unrelated to the former). So I guess that this just went unnoticed since it will not cause problems on small data frames. Best regards Hilmar
1 day later
Dear all, actually, it is not clear to me why there is still a protection of the added Row.names column in merge using I(). This seems to stem from a time when R would automatically convert character vectors to factor in data.frame on insert. However, I can't reproduce this behaviour even in data.frames generated with stringsAsFactors = T in current versions of R. Maybe the I() inserted in r 39026 can be removed altogether? Best regards Hilmar
On 14.07.24 19:09, HB via R-devel wrote:
Dear Ivan, thanks for the confirmation and the proposed patch. I just wanted to add some notes regarding the relevance of this: base::merge using by.x=0 or by.y=0 (i.e. matching on row.names) will automatically add a column Row.names which is I(row.names(x)) to the corresponding input table (using I() since revision 39026 to avoid conversion of character to factor). When this column is used for sorting (sort=TRUE by default in merge; should happen at least if all.x=T or all.y=T), this will result in slower execution. xtfrm.AsIs is unchanged since its addition in r50992 (likely unrelated to the former). So I guess that this just went unnoticed since it will not cause problems on small data frames. Best regards Hilmar [[alternative HTML version deleted]]
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
2 days later
Ivan Krylov via R-devel writes:
Thanks: I just changed xtfrm.AsIs() as suggested. Best -k
? Fri, 12 Jul 2024 17:35:19 +0200 Hilmar Berger via R-devel <r-devel at r-project.org> ?????:
This can be finally traced to base::rank() (called from xtfrm.default), where I found that "NB: rank is not itself generic but xtfrm is, and rank(xtfrm(x), ....) will have the desired result if there is a xtfrm method. Otherwise, rank will make use of ==, >, is.na and extraction methods for classed objects, possibly rather slowly. "
The problem is indeed that the vector reaches base::rank in both cases, but since it has a class, the function has to construct and evaluate a call to .gt every time it wants to compare two elements.
xtfrm.AsIs even tries to remove the 'AsIs' class before continuing the method dispatch process:
if (length(cl <- class(x)) > 1) oldClass(x) <- cl[-1L]
It doesn't work in the (very contrived) case when 'AsIs' is not the first class and it doesn't remove 'AsIs' as the only class (making static int equal(...) take the slower branch). What's going to break if we allow removing the class attribute altogether? This seems to speed up xtfrm(I(x)) and survive LC_ALL=C.UTF-8 make check-devel:
Index: src/library/base/R/sort.R =================================================================== --- src/library/base/R/sort.R (revision 86895) +++ src/library/base/R/sort.R (working copy) @@ -297,7 +297,8 @@
xtfrm.AsIs <- function(x)
{
- if(length(cl <- class(x)) > 1) oldClass(x) <- cl[-1L]
+ cl <- oldClass(x)
+ oldClass(x) <- cl[cl != 'AsIs']
NextMethod("xtfrm")
}
-- Best regards, Ivan
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Hilmar Berger via R-devel writes:
Thanks. I just removed the I() as suggested. Best -k
Dear all, actually, it is not clear to me why there is still a protection of the added Row.names column in merge using I(). This seems to stem from a time when R would automatically convert character vectors to factor in data.frame on insert. However, I can't reproduce this behaviour even in data.frames generated with stringsAsFactors = T in current versions of R. Maybe the I() inserted in r 39026 can be removed altogether?
Best regards
Hilmar
On 14.07.24 19:09, HB via R-devel wrote:
Dear Ivan, thanks for the confirmation and the proposed patch. I just wanted to add some notes regarding the relevance of this: base::merge using by.x=0 or by.y=0 (i.e. matching on row.names) will automatically add a column Row.names which is I(row.names(x)) to the corresponding input table (using I() since revision 39026 to avoid conversion of character to factor). When this column is used for sorting (sort=TRUE by default in merge; should happen at least if all.x=T or all.y=T), this will result in slower execution. xtfrm.AsIs is unchanged since its addition in r50992 (likely unrelated to the former). So I guess that this just went unnoticed since it will not cause problems on small data frames. Best regards Hilmar [[alternative HTML version deleted]]
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel