# Timing is all on my local machine (OSX)
N_v <- sample(c(1,0), 10^7, replace = TRUE)
L_v <- sample(c(TRUE, FALSE), 10^7, replace = TRUE)
# user system elapsed
system.time(table(N_v)) # 2.155 0.039 2.192
system.time(table(L_v)) # 0.806 0.030 0.838
system.time(N_fv <- as.factor(N_v)) # 2.026 0.024 2.050
system.time(L_fv <- as.factor(L_v)) # 0.668 0.015 0.683
system.time(table(N_fv)) # 0.133 0.022 0.156
system.time(table(L_fv)) # 0.134 0.018 0.151
The performance for Integers and specially booleans is quite surprising.
Of note is that the performance is significantly better if using `tabulate`, since this doesn't involve a conversion to factor (though input must be numeric/factor, results aren't named, and it has worse handling of NA values). If you have performance critical calls like this you could consider using `tabulate` instead.
system.time(tabulate(N_v)) # 0.054 0.002 0.056
system.time(tabulate(as.integer(L_v))) # 0.052 0.002 0.055
I don't know if this is a known issue or not; most of my colleagues are aware of the slow-down and use `tabulate` when performance is required. My understanding was that the slower performance is a trade-off for more consistent performance (better output, better handling of ambiguities/NA, etc.), and that speed isn't the highest priority with `table`. Maybe someone else has a better understanding of the history of the function.
As for improving the speed, it would basically come down to refactoring `table` to not use a `factor` conversion. I'd be concerned about introducing a lot of edge cases with that, but it's theoretically possible. Based on 30 seconds of thinking, it may be possible to do something like:
## just a sketch of a barebones non-factor implementation
test_tab <- function(x){
lookup <- unique(x)
counts <- tabulate(match(x, lookup))
names(counts) <- as.character(lookup)
counts
}
system.time(test_tab(L_v)) # 0.101 0.006 0.107
system.time(test_tab(N_v)) # 0.129 0.015 0.144
This is also faster in the case where there are lots of categories with few entries per category:
N_v2 <- 1:1e7
system.time(test_tab(N_v2)) # 0.383 0.024 0.411
system.time(table(N_v2)) # 6.122 0.228 6.398
Obviously there are some big shortcomings:
- it's missing a lot of error checking etc. that the standard `table` has
- it only works with 1D vectors
- NA handling isn't quite the same as `table` (though it would be easy to adapt)
Just including to potentially start discussion for optimization.
For reference, the relevant section is in src/library/base/R/table.R:L75-85
-Aidan
-----------------------
Aidan Lakshman (he/him)
http://www.ahl27.com/
On 21 Mar 2025, at 8:26, Karolis Koncevi?ius wrote:
[You don't often get email from karolis.koncevicius at gmail.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
I was calling table() on some long logical vectors and noticed that it took a long time.
Out of curiosity I checked the performance of table() on different types, and had some unexpected results:
C <- sample(c("yes", "no"), 10^7, replace = TRUE)
F <- factor(sample(c("yes", "no"), 10^7, replace = TRUE))
N <- sample(c(1,0), 10^7, replace = TRUE)
I <- sample(c(1L,0L), 10^7, replace = TRUE)
L <- sample(c(TRUE, FALSE), 10^7, replace = TRUE)
# ordered by execution time
# user system elapsed
system.time(table(F)) # 0.088 0.006 0.093
system.time(table(C)) # 0.208 0.017 0.224
system.time(table(I)) # 0.242 0.019 0.261
system.time(table(L)) # 0.665 0.015 0.680
system.time(table(N)) # 1.771 0.019 1.791
The performance for Integers and specially booleans is quite surprising.
After investigating the source of table, I ended up on the reason being ?as.character()?:
system.time(as.character(L))
user system elapsed
0.461 0.002 0.462
Even a manual conversion can achieve a speed-up by a factor of ~7:
system.time(c("FALSE", "TRUE")[L+1])
user system elapsed
0.061 0.006 0.067
Tested on 4.4.3 as well as devel trunk.
Just reporting for comments and attention.
Karolis K.