Skip to content

write.table with row.names=FALSE unnecessarily slow?

4 messages · Martin Morgan, Brian Ripley, Martin Maechler

#
write.table with large data frames takes quite a long time
+     write.table(df, '/tmp/dftest.txt', row.names=FALSE)
+ }, gcFirst=TRUE)
   user  system elapsed 
 97.302   1.532  98.837 

A reason is because dimnames is always called, causing 'anonymous' row
names to be created as character vectors. Avoiding this in
src/library/utils, along the lines of

Index: write.table.R
===================================================================
--- write.table.R	(revision 44717)
+++ write.table.R	(working copy)
@@ -27,13 +27,18 @@
 
     if(!is.data.frame(x) && !is.matrix(x)) x <- data.frame(x)
 
+    makeRownames <- is.logical(row.names) && !is.na(row.names) &&
+                    row.names==TRUE
+    makeColnames <- is.logical(col.names) && !is.na(col.names) &&
+                    col.names==TRUE
     if(is.matrix(x)) {
         ## fix up dimnames as as.data.frame would
         p <- ncol(x)
         d <- dimnames(x)
         if(is.null(d)) d <- list(NULL, NULL)
-        if(is.null(d[[1]])) d[[1]] <- seq_len(nrow(x))
-        if(is.null(d[[2]]) && p > 0) d[[2]] <-  paste("V", 1:p, sep="")
+        if (is.null(d[[1]]) && makeRownames) d[[1]] <- seq_len(nrow(x))
+        if(is.null(d[[2]]) && p > 0 && makeColnames)
+            d[[2]] <-  paste("V", 1:p, sep="")
         if(is.logical(quote) && quote)
             quote <- if(is.character(x)) seq_len(p) else numeric(0)
     } else {
@@ -53,8 +58,8 @@
                 quote <- ord[quote]; quote <- quote[quote > 0]
             }
         }
-        d <- dimnames(x)
-        if(is.null(d[[1]])) d[[1]] <- seq_len(nrow(x))
+        d <- list(if (makeRownames==TRUE) row.names(x) else NULL,
+                  if (makeColnames==TRUE) names(x) else NULL)
         p <- ncol(x)
     }
     nocols <- p==0

improves performance at least in proportion to nrow(x):
+     write.table(df, '/tmp/dftest1.txt', row.names=FALSE)
+ }, gcFirst=TRUE)
   user  system elapsed 
  8.132   0.608   8.899 

Martin
#
I neglected to include my test case,
Martin

Martin Morgan <mtmorgan at fhcrc.org> writes:

  
    
#
This is a pretty extreme case: why not use write() to write a single 
column?  (It's a bit faster than your patched timing.)

In a more realistic test of 10 columns of 1 million rows I see a speedup 
from 12.2 to 9.7 seconds.

So I'll add the patch, but think that significant speedups will be quite 
rare.

BTW, this seems to be one of the places where we are paying the price of 
the CHARSXP cache: system.time(as.character(1:1e7)) has got a lot slower.
Maybe some further tuning is called for.
On Mon, 10 Mar 2008, Martin Morgan wrote:

            

  
    
#
MartinMo> write.table with large data frames takes quite a long time
    MartinMo> system.time({
    MartinMo> +     write.table(df, '/tmp/dftest.txt', row.names=FALSE)
    MartinMo> + }, gcFirst=TRUE)
    MartinMo> user  system elapsed 
    MartinMo> 97.302   1.532  98.837 

    MartinMo> A reason is because dimnames is always called, causing 'anonymous' row
    MartinMo> names to be created as character vectors. Avoiding this in
    MartinMo> src/library/utils, along the lines of

Thank you, Martin.

Note that we needed to fix your patch 
(for the case where the dataframe has 'matrix column'),

and I'd like to further remark that I consider
 '.... == TRUE '
to be quite ugly (or inefficient) in all circumstances.

Martin Maechler, ETH Zurich



Index: write.table.R
===================================================================
--- write.table.R	(revision 44717)
+++ write.table.R	(working copy)
@@ -27,13 +27,18 @@
 
     if(!is.data.frame(x) && !is.matrix(x)) x <- data.frame(x)
 
+    makeRownames <- is.logical(row.names) && !is.na(row.names) &&
+                    row.names==TRUE
+    makeColnames <- is.logical(col.names) && !is.na(col.names) &&
+                    col.names==TRUE
     if(is.matrix(x)) {
         ## fix up dimnames as as.data.frame would
         p <- ncol(x)
         d <- dimnames(x)
         if(is.null(d)) d <- list(NULL, NULL)
-        if(is.null(d[[1]])) d[[1]] <- seq_len(nrow(x))
-        if(is.null(d[[2]]) && p > 0) d[[2]] <-  paste("V", 1:p, sep="")
+        if (is.null(d[[1]]) && makeRownames) d[[1]] <- seq_len(nrow(x))
+        if(is.null(d[[2]]) && p > 0 && makeColnames)
+            d[[2]] <-  paste("V", 1:p, sep="")
         if(is.logical(quote) && quote)
             quote <- if(is.character(x)) seq_len(p) else numeric(0)
     } else {
@@ -53,8 +58,8 @@
                 quote <- ord[quote]; quote <- quote[quote > 0]
             }
         }
-        d <- dimnames(x)
-        if(is.null(d[[1]])) d[[1]] <- seq_len(nrow(x))
+        d <- list(if (makeRownames==TRUE) row.names(x) else NULL,
+                  if (makeColnames==TRUE) names(x) else NULL)
         p <- ncol(x)
     }
     nocols <- p==0