Subsetting a data frame vs. subsetting the columns
On Wed, Dec 28, 2011 at 8:14 AM, Simon Urbanek
<simon.urbanek at r-project.org> wrote:
Hadley, there was a whole discussion about subsetting and subassigning data frames (and general efficiency issues) some time ago (I can't find it in a hurry but others might)
Yep, a rather lengthy discussion at that http://r.789695.n4.nabble.com/speeding-up-perception-td3640920.html. IIRC, there was also some off list stuff about what it would take to push to C, which I may have in my inbox if anyone wants. Cheers, Josh -- just look at the `[.data.frame` code to see why it's so slow. It would need to be pushed into C code to allow certain optimizations, but it's a quite complex code so I don't think there were volunteers. So the advice is don't do it ;). Treating DFs as lists is always faster since you get to the fast internal code.
Cheers, S On Dec 28, 2011, at 10:37 AM, Hadley Wickham wrote:
Hi all,
There seems to be rather a large speed disparity in subsetting when
working with a whole data frame vs. working with just columns
individually:
df <- as.data.frame(replicate(10, runif(1e5)))
ord <- order(df[[1]])
system.time(df[ord, ])
# ? user ?system elapsed
# ?0.043 ? 0.007 ? 0.059
system.time(lapply(df, function(x) x[ord]))
# ? user ?system elapsed
# ?0.022 ? 0.008 ? 0.029
What's going on?
I realise this isn't quite a fair example because the second case
makes a list not a data frame, but I thought it would be quick
operation to turn a list into a data frame if you don't do any
checking:
list_to_df <- function(list) {
?n <- length(list[[1]])
?structure(list,
? ?class = "data.frame",
? ?row.names = c(NA, -n))
}
system.time(list_to_df(lapply(df, function(x) x[ord])))
# ? ?user ?system elapsed
# ?0.031 ? 0.017 ? 0.048
So I guess this is slow because it has to make a copy of the whole
data frame to modify the structure. ?But couldn't [.data.frame avoid
that?
Hadley
--
Assistant Professor / Dobelman Family Junior Chair
Department of Statistics / Rice University
http://had.co.nz/
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Joshua Wiley Ph.D. Student, Health Psychology Programmer Analyst II, Statistical Consulting Group University of California, Los Angeles https://joshuawiley.com/