write.table with row.names=FALSE unnecessarily slow?
This is a pretty extreme case: why not use write() to write a single column? (It's a bit faster than your patched timing.) In a more realistic test of 10 columns of 1 million rows I see a speedup from 12.2 to 9.7 seconds. So I'll add the patch, but think that significant speedups will be quite rare. BTW, this seems to be one of the places where we are paying the price of the CHARSXP cache: system.time(as.character(1:1e7)) has got a lot slower. Maybe some further tuning is called for.
On Mon, 10 Mar 2008, Martin Morgan wrote:
I neglected to include my test case,
df <- data.frame(x=1:(10^7))
Martin Martin Morgan <mtmorgan at fhcrc.org> writes:
write.table with large data frames takes quite a long time
system.time({
+ write.table(df, '/tmp/dftest.txt', row.names=FALSE)
+ }, gcFirst=TRUE)
user system elapsed
97.302 1.532 98.837
A reason is because dimnames is always called, causing 'anonymous' row
names to be created as character vectors. Avoiding this in
src/library/utils, along the lines of
Index: write.table.R
===================================================================
--- write.table.R (revision 44717)
+++ write.table.R (working copy)
@@ -27,13 +27,18 @@
if(!is.data.frame(x) && !is.matrix(x)) x <- data.frame(x)
+ makeRownames <- is.logical(row.names) && !is.na(row.names) &&
+ row.names==TRUE
+ makeColnames <- is.logical(col.names) && !is.na(col.names) &&
+ col.names==TRUE
if(is.matrix(x)) {
## fix up dimnames as as.data.frame would
p <- ncol(x)
d <- dimnames(x)
if(is.null(d)) d <- list(NULL, NULL)
- if(is.null(d[[1]])) d[[1]] <- seq_len(nrow(x))
- if(is.null(d[[2]]) && p > 0) d[[2]] <- paste("V", 1:p, sep="")
+ if (is.null(d[[1]]) && makeRownames) d[[1]] <- seq_len(nrow(x))
+ if(is.null(d[[2]]) && p > 0 && makeColnames)
+ d[[2]] <- paste("V", 1:p, sep="")
if(is.logical(quote) && quote)
quote <- if(is.character(x)) seq_len(p) else numeric(0)
} else {
@@ -53,8 +58,8 @@
quote <- ord[quote]; quote <- quote[quote > 0]
}
}
- d <- dimnames(x)
- if(is.null(d[[1]])) d[[1]] <- seq_len(nrow(x))
+ d <- list(if (makeRownames==TRUE) row.names(x) else NULL,
+ if (makeColnames==TRUE) names(x) else NULL)
p <- ncol(x)
}
nocols <- p==0
improves performance at least in proportion to nrow(x):
system.time({
+ write.table(df, '/tmp/dftest1.txt', row.names=FALSE) + }, gcFirst=TRUE) user system elapsed 8.132 0.608 8.899 Martin -- Martin Morgan Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M2 B169 Phone: (206) 667-2793
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
-- Martin Morgan Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M2 B169 Phone: (206) 667-2793
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595