Skip to content
Prev 257855 / 398502 Next

How to erase (replace) certain elements in the data.frame?

Hi Sergey,

This is not an answer to your exact question, but can you use a
matrix?  If you can use a matrix instead of a data frame, you should
get a considerable performance boost.  Even for very large matrices
(at least on my system), it is fast enough I find it hard to believe
it is a bottle neck in the overall imputation process.  For example,
for a 1000 by 100 object
as a data frame:
user  system elapsed
   1.09    0.02    1.12
and as a matrix:
user  system elapsed
   0.02    0.00    0.01

Beyond that, for very large objects, this revision gives a slight
(i.e., around 5 seconds for 1 million by 100 column object on my
system) performance increase, which is small for matrices and
completely dwarfed by other bottlenecks for data frames, at the cost
of readability/flexibility:

rdel <- function (x, n.keeprows, del.percent){
  n.items <- ncol(x)
  k <- as.integer(n.items * del.percent / 100)
  cols <- 1:n.items
  lcols <- length(cols)
  for (i in (n.keeprows+1):nrow(x)){
    j <- cols[.Internal(sample(lcols, k, FALSE, NULL))]
    x[i,j] <- NA
  }
  return(x)
}

If you must use a data frame, you can gain some performance increase
(for a 10000 by 100 data frame, it takes about 30 seconds on my system
versus 40 for your original function) by using:

random.del2 <- function (x, n.keeprows, del.percent){
  n.items <- ncol(x)
  k <- n.items*(del.percent/100)
  for (i in (n.keeprows+1):nrow(x)){
    j <- sample(1:n.items, k)
    `[<-.data.frame`(x, i, j, NA)
  }
  return(x)
}

which basically just saves R the trouble of figuring out which
assignment method to use.  Of course the problem is that your function
becomes extremely specialized.  If you pass anything to it but a data
frame, good things will not happen.

Cheers,

Josh
On Sat, Apr 23, 2011 at 5:37 PM, sneaffer <sneaffer at mail.ru> wrote: