Skip to content
Prev 243022 / 398500 Next

help: program efficiency

If the input vector t is known to be ordered
(or if you only care about runs of duplicated
values, not all duplicated values) the following
is pretty quick

nodup3 <- function (t) { 
    t + (sequence(rle(t)$lengths) - 1)/100
}

If you don't know if the the input will be ordered
then ave() will do it a bit faster than your
code

nodup2 <- function (t) { 
    ave(t, t, FUN = function(x) x + (seq_along(x) - 1)/100)
}

E.g., for a sorted sequence of 300,000 numbers drawn with
replacement from 1:100,000 I get:
user  system elapsed 
   2.78    0.05    3.97
user  system elapsed 
   1.83    0.02    2.66
user  system elapsed 
   0.18    0.00    0.14
[1] TRUE

If speed is truly an issue, the built-in sequence may
be replaced by a faster one that does the same thing:

nodup3a <- function (t) {
    faster.sequence <- function(nvec) {
        seq_len(sum(nvec)) - rep(cumsum(c(0L, nvec[-length(nvec)])), 
            nvec)
    }
    t + (faster.sequence(rle(t)$lengths) - 1)/100
}

That took 0.05 seconds on the a2 dataset and produced
identical results.

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
9p3059079.html