Filling out a data frame row by row.... slow!

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20120214/722b7628/attachment.pl>
If you must repeatedly append rows to a data.frame,
try making the dataset you are filling in a bunch
of independent vectors, perhaps in a new environment
to keep things organized, and expand each at the same time.
At the very end make a data.frame out of those vectors.
E.g., change the likes of

f0 <- function (nRow) 
{
    incrSize <- 10000
    curSize <- 10000
    data <- data.frame(x = numeric(curSize), y = numeric(curSize), 
        z = numeric(curSize))
    for (i in seq_len(nRow)) {
        if (i > curSize) {
            data <- rbind(data, data.frame(x = numeric(incrSize), 
                y = numeric(incrSize), z = numeric(incrSize)))
            curSize <- nrow(data)
        }
        data[i, ] <- c(i + 0.1, i + 0.2, i + 0.3)
    }
    data[seq_len(nRow), , drop = FALSE]
}

to

f1 <- function (nRow) 
{
    incrSize <- 10000
    curSize <- min(10000, nRow)
    data <- as.environment(list(x = numeric(curSize), y = numeric(curSize), 
        z = numeric(curSize)))
    for (i in seq_len(nRow)) {
        if (i > curSize) {
            curSize <- min(curSize + incrSize, nRow)
            for (name in objects(data)) {
                length(data[[name]]) <- curSize
            }
        }
        data$x[i] <- i + 0.1
        data$y[i] <- i + 0.2
        data$z[i] <- i + 0.3
    }
    data.frame(as.list(data)) # use x=data$x, y=data$y, ... if order is important.
}

Here are some timing results for the above functions
system.time(r1 <- f1(5000))
user  system elapsed 
   0.13    0.00    0.14
system.time(r1 <- f1(15000))
user  system elapsed 
   0.33    0.00    0.32
system.time(r1 <- f1(25000))
user  system elapsed 
   0.51    0.00    0.47
system.time(r0 <- f0(5000))
user  system elapsed 
   5.23    0.02    5.13
system.time(r0 <- f0(15000))
user  system elapsed 
  21.75    0.00   20.67
system.time(r0 <- f0(25000))
user  system elapsed 
  87.31    0.01   86.00
# results are same, except for the order of the columns
all.equal(r0[, c("x","y","z")], r1[, c("x","y","z")])
[1] TRUE

For 2 million rows f1 is getting a little superlinear: 2e6/25000 * .5 = 40 seconds, if time linear in nRow, but I get 55 s.
system.time(r1 <- f1(2e6)) 
user  system elapsed 
  52.19    3.81   54.69

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of Peter Meilstrup
Sent: Tuesday, February 14, 2012 1:47 PM
To: r-help at r-project.org
Subject: [R] Filling out a data frame row by row.... slow!

I'm reading a file and using the file to populate a data frame. The way the
file is laid out, I need to fill in the data frame one row at a time.

When I start reading my file, I don't know how many rows I will need. It's
on the order of a million.

Being mindful of the time expense of reallocation, I decided on a strategy
of doubling the data frame size every time I needed to expand it ...
therefore memory is never more than 50% wasted, and it should still finish
in O(N) time.

But it still, somehow has an O(N^2) performance characteristic. It seems
like just setting a single element is slow in a larger data frame as
compared to a smaller one. Here is a toy function to illustrate,
reallocating and filling in single rows in a data frame, and shows the
slowdown:

populate.data.frame.test <- function(n=1000000, chunk=1000) {
  i = 0;
  df <- data.frame(a=numeric(0), b=numeric(0), c=numeric(0));
  t <- proc.time()[2]
  for (i in 1:n) {
    if (i %% chunk == 0) {
      elapsed <- -(t - (t <- proc.time()[2]))
      cat(sprintf("%d rows: %g rows per sec, nrows = %d\n", i,
chunk/elapsed, nrow(df)))

    }

    ##double data frame size if necessary
    while (nrow(df)<i) {
      df[max(i, 2*nrow(df)),] <- NA
      cat(sprintf("Doubled to %d rows\n", nrow(df)));
    }

    ##fill in one row
    df[i, c('a', 'b', 'c')] <- list(runif(1), i, runif(1))
  }
}

Is there a way to do this that avoids the slowdown? The data cannot be
represented as a matrix (different columns have different types.)

Peter

	[[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20120215/ea14923e/attachment.pl>
An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20120215/fafd2cd1/attachment.pl>
First, in R there is no need to declare the dimensions of your objects
before they are populated so couldn't you reduce some run time by not
going through the double data.frame step ?
df<- data.frame()
df
data frame with 0 columns and 0 rows
for(i in 1:100) for(j in 1:3) df[i,j]<- runif(1)
str(df)
'data.frame':   100 obs. of  3 variables:
 ...

Second, about populating an environment ?assign might work better for you
e<- new.env()
system.time(for(i in 1:10000) e$a[i]<- rnorm(1,i))
user  system elapsed
   0.97    0.00    0.96
rm(e)
e<- new.env()
system.time(for(i in 1:10000) assign('a',rnorm(1,i),env=e))
user  system elapsed
   0.17    0.00    0.17

Third, how are you reading in the file? and what does that mean "not
knowing in advance..." ? Bill's suggestion to not populate the
data.frame line by line is probably the "real" solution to your
problem, as otherwise it's a little like kicking a turtle to make it
go faster...try to find a rabbit instead.

Posting a minimal example of your file format would have really
helped. Often using ?scan to read the whole (or big chunks of the)
file into R, followed by a customized formatting function that
utilizes ?grep and ?strsplit to reconstruct the data you want in
columns, solves the NEED to populate a data frame line by line.

Hope this helps

Elai
One complication is I don't know the names of the columns I'm assigning to
before I read them off the file. And crazily, if I change this:
? ? ? data$x[i] <- i + 0.1

where data is an environment and x a primitive vector, to use a computed
name instead:

?data[[colname]][i] <- i + 0.1

Then I get back to way-superlinear performance. Eventually I found I could
work around it like:

eval(substitute(var[ix] <- data,
? ? ? ? ? ? ? ? ? ? ? ? ?list(var=as.name(colname), ix=i, data = i+0.1)),
? ? ? ? ? ? ? envir = data)

but... as workarounds go that seems to be on the crazy nuts end of the
scale. Why does [[]] impose a performance penalty?

Peter

? ? ? ?[[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.