Skip to content

Filling out a data frame row by row.... slow!

5 messages · Peter Meilstrup, William Dunlap, ilai

#
If you must repeatedly append rows to a data.frame,
try making the dataset you are filling in a bunch
of independent vectors, perhaps in a new environment
to keep things organized, and expand each at the same time.
At the very end make a data.frame out of those vectors.
E.g., change the likes of

f0 <- function (nRow) 
{
    incrSize <- 10000
    curSize <- 10000
    data <- data.frame(x = numeric(curSize), y = numeric(curSize), 
        z = numeric(curSize))
    for (i in seq_len(nRow)) {
        if (i > curSize) {
            data <- rbind(data, data.frame(x = numeric(incrSize), 
                y = numeric(incrSize), z = numeric(incrSize)))
            curSize <- nrow(data)
        }
        data[i, ] <- c(i + 0.1, i + 0.2, i + 0.3)
    }
    data[seq_len(nRow), , drop = FALSE]
}

to

f1 <- function (nRow) 
{
    incrSize <- 10000
    curSize <- min(10000, nRow)
    data <- as.environment(list(x = numeric(curSize), y = numeric(curSize), 
        z = numeric(curSize)))
    for (i in seq_len(nRow)) {
        if (i > curSize) {
            curSize <- min(curSize + incrSize, nRow)
            for (name in objects(data)) {
                length(data[[name]]) <- curSize
            }
        }
        data$x[i] <- i + 0.1
        data$y[i] <- i + 0.2
        data$z[i] <- i + 0.3
    }
    data.frame(as.list(data)) # use x=data$x, y=data$y, ... if order is important.
}

Here are some timing results for the above functions
user  system elapsed 
   0.13    0.00    0.14
user  system elapsed 
   0.33    0.00    0.32
user  system elapsed 
   0.51    0.00    0.47
user  system elapsed 
   5.23    0.02    5.13
user  system elapsed 
  21.75    0.00   20.67
user  system elapsed 
  87.31    0.01   86.00
[1] TRUE

For 2 million rows f1 is getting a little superlinear: 2e6/25000 * .5 = 40 seconds, if time linear in nRow, but I get 55 s.
user  system elapsed 
  52.19    3.81   54.69

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
#
First, in R there is no need to declare the dimensions of your objects
before they are populated so couldn't you reduce some run time by not
going through the double data.frame step ?
data frame with 0 columns and 0 rows
'data.frame':   100 obs. of  3 variables:
 ...

Second, about populating an environment ?assign might work better for you
user  system elapsed
   0.97    0.00    0.96
user  system elapsed
   0.17    0.00    0.17

Third, how are you reading in the file? and what does that mean "not
knowing in advance..." ? Bill's suggestion to not populate the
data.frame line by line is probably the "real" solution to your
problem, as otherwise it's a little like kicking a turtle to make it
go faster...try to find a rabbit instead.

Posting a minimal example of your file format would have really
helped. Often using ?scan to read the whole (or big chunks of the)
file into R, followed by a customized formatting function that
utilizes ?grep and ?strsplit to reconstruct the data you want in
columns, solves the NEED to populate a data frame line by line.

Hope this helps

Elai