Skip to content

Subsetting a data frame vs. subsetting the columns

4 messages · Simon Urbanek, Joshua Wiley, Hadley Wickham

#
Hi all,

There seems to be rather a large speed disparity in subsetting when
working with a whole data frame vs. working with just columns
individually:

df <- as.data.frame(replicate(10, runif(1e5)))
ord <- order(df[[1]])

system.time(df[ord, ])
#   user  system elapsed
#  0.043   0.007   0.059
system.time(lapply(df, function(x) x[ord]))
#   user  system elapsed
#  0.022   0.008   0.029

What's going on?

I realise this isn't quite a fair example because the second case
makes a list not a data frame, but I thought it would be quick
operation to turn a list into a data frame if you don't do any
checking:

list_to_df <- function(list) {
  n <- length(list[[1]])
  structure(list,
    class = "data.frame",
    row.names = c(NA, -n))
}
system.time(list_to_df(lapply(df, function(x) x[ord])))
#    user  system elapsed
#  0.031   0.017   0.048

So I guess this is slow because it has to make a copy of the whole
data frame to modify the structure.  But couldn't [.data.frame avoid
that?

Hadley
#
Hadley,

there was a whole discussion about subsetting and subassigning data frames (and general efficiency issues) some time ago (I can't find it in a hurry but others might) -- just look at the `[.data.frame` code to see why it's so slow. It would need to be pushed into C code to allow certain optimizations, but it's a quite complex code so I don't think there were volunteers. So the advice is don't do it ;). Treating DFs as lists is always faster since you get to the fast internal code.

Cheers,
S
On Dec 28, 2011, at 10:37 AM, Hadley Wickham wrote:

            
#
On Wed, Dec 28, 2011 at 8:14 AM, Simon Urbanek
<simon.urbanek at r-project.org> wrote:
Yep, a rather lengthy discussion at that
http://r.789695.n4.nabble.com/speeding-up-perception-td3640920.html.
IIRC, there was also some off list stuff about what it would take to
push to C, which I may have in my inbox if anyone wants.

Cheers,

Josh

-- just look at the `[.data.frame` code to see why it's so slow. It
would need to be pushed into C code to allow certain optimizations,
but it's a quite complex code so I don't think there were volunteers.
So the advice is don't do it ;). Treating DFs as lists is always
faster since you get to the fast internal code.

  
    
#
Ah, thanks for the pointers!
Hadley

On Wed, Dec 28, 2011 at 10:14 AM, Simon Urbanek
<simon.urbanek at r-project.org> wrote: