Hi,
I have made a tiny package for saving dataframes in ASCII format. The
package contains functions save.table() and save.delim(), the first
mimics (not completely) write.table() and the second uses just
different default values, suitable for read.delim().
The reason I have written the functions is that I have had problems
with saving large dataframes in ASCII form. write.table() essentially
makes a huge string in memory from the dataframe. I am not sure about
write.matrix() (in MASS), but in my practice it is too
memory-intensive also. My approach was to write the whole thing in C
in this way that the function takes the values from the dataframe, one
scalar value by time, and writes them immediately to the file. This,
of course, puts certain limitations on the contents of dataframe and
output format.
Here is an example of the result:
[1] 4.01 0.66 5.45 0.00 0.00
None of the functions started swapping now, but as you can see,
save.table() is still around 10 times as fast as write.matrix().
Examples are on my 128MB PII-400 linux system and R 1.4.0.
I am not sure if there is much interest for such a package, so I put
it on my own website instead of CRAN
(http://www.obs.ee/~siim/savetable_0.1.0.tar.gz). Any feedback is
appreciated.
Many thanks to Brian Ripley and the others, who helped me accessing R
objects in C.
Best wishes,
Ott Toomet
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
?write.matrix will tell you what you have overlooked, a sensible
blocksize.
If `I am not sure about write.matrix()', surely reading the help page is a
first step?
On Sat, 10 Aug 2002, Ott Toomet wrote:
Hi,
I have made a tiny package for saving dataframes in ASCII format. The
package contains functions save.table() and save.delim(), the first
mimics (not completely) write.table() and the second uses just
different default values, suitable for read.delim().
The reason I have written the functions is that I have had problems
with saving large dataframes in ASCII form. write.table() essentially
makes a huge string in memory from the dataframe. I am not sure about
write.matrix() (in MASS), but in my practice it is too
memory-intensive also. My approach was to write the whole thing in C
in this way that the function takes the values from the dataframe, one
scalar value by time, and writes them immediately to the file. This,
of course, puts certain limitations on the contents of dataframe and
output format.
Here is an example of the result:
[1] 4.01 0.66 5.45 0.00 0.00
None of the functions started swapping now, but as you can see,
save.table() is still around 10 times as fast as write.matrix().
Examples are on my 128MB PII-400 linux system and R 1.4.0.
I am not sure if there is much interest for such a package, so I put
it on my own website instead of CRAN
(http://www.obs.ee/~siim/savetable_0.1.0.tar.gz). Any feedback is
appreciated.
Many thanks to Brian Ripley and the others, who helped me accessing R
objects in C.
Best wishes,
Ott Toomet
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
Brian D. Ripley, ripley at stats.ox.ac.uk
Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel: +44 1865 272861 (self)
1 South Parks Road, +44 1865 272860 (secr)
Oxford OX1 3TG, UK Fax: +44 1865 272595
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
Hi,
I am Continuing discussion about dataframes in ASCII.
I have not overlooked the argument blocksize in write.matrix(), but
which is a sensible size? I assumed the blocksize=1 is the most
memory-efficient, but (for smaller example) I experimented with
different sizes. Initially, speed increased slightly, but seemed to
be constant or even decreasing from around value 5.
The problem for me is not the speed for small dataframes but the fact
that I was not able to save a large dataframe at all. I think the
reason is associated with the first line of write.matrix() which is
x <- as.matrix(x)
This converts the whole dataframe into a new ascii matrix, a process which
is both slow and memory consuming if the original object is large. The
second place I am not sure about are lines
cat(format(t(x[nlines + (1:nb), ])), file = file,
append = TRUE, sep = c(rep(sep, p - 1), "\n"))
isn't t(x[...]) creating new temporary objects?
Or have I misunderstood something?
BTW, are there any ways to check memory consumption of individual
objects and functions?
best wishes,
Ott
On Sat, 10 Aug 2002 ripley at stats.ox.ac.uk wrote:
|?write.matrix will tell you what you have overlooked, a sensible
|blocksize.
|
|If `I am not sure about write.matrix()', surely reading the help page is a
|first step?
|
|On Sat, 10 Aug 2002, Ott Toomet wrote:
|
|> Hi,
|>
|> I have made a tiny package for saving dataframes in ASCII format. The
|> package contains functions save.table() and save.delim(), the first
|> mimics (not completely) write.table() and the second uses just
|> different default values, suitable for read.delim().
|>
|> The reason I have written the functions is that I have had problems
|> with saving large dataframes in ASCII form. write.table() essentially
|> makes a huge string in memory from the dataframe. I am not sure about
|> write.matrix() (in MASS), but in my practice it is too
|> memory-intensive also. My approach was to write the whole thing in C
|> in this way that the function takes the values from the dataframe, one
|> scalar value by time, and writes them immediately to the file. This,
|> of course, puts certain limitations on the contents of dataframe and
|> output format.
|>
|> Here is an example of the result:
|>
|> > dim(e2000)
|> [1] 7505 1197
|> > library(savetable)
|> > system.time(save.table(e2000, "e2000"))
|> [1] 38.04 0.48 48.75 0.00 0.00
|> > library(MASS)
|> > system.time(write.matrix(e2000, "e2000", sep=",", 1))
|>
|> -- killed after 10 minutes swapping.
|>
|> And now a smaller example:
|>
|> > dim(e2000s)
|> [1] 100 1197
|> > library(savetable)
|> > system.time(save.table(e2000s, "e2000s"))
|> [1] 0.45 0.00 0.56 0.00 0.00
|> > system.time(write.table(e2000s, "e2000s"))
|> [1] 31.21 0.11 38.99 0.00 0.00
|> > library(MASS)
|> > system.time(write.matrix(e2000s, "e2000s", sep=",", 1))
|> [1] 4.01 0.66 5.45 0.00 0.00
|>
|> None of the functions started swapping now, but as you can see,
|> save.table() is still around 10 times as fast as write.matrix().
|> Examples are on my 128MB PII-400 linux system and R 1.4.0.
|>
|> I am not sure if there is much interest for such a package, so I put
|> it on my own website instead of CRAN
|> (http://www.obs.ee/~siim/savetable_0.1.0.tar.gz). Any feedback is
|> appreciated.
|>
|> Many thanks to Brian Ripley and the others, who helped me accessing R
|> objects in C.
|>
|>
|> Best wishes,
|>
|> Ott Toomet
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
The sort of `large' here is 7500x1200. That's 72Mb if real numbers, so
let's assume you have at least 256Mb to use. I ran the following on
Windows with a 256Mb limit (and I had to use R-devel to do so). I actually
found it difficult to create a data frame of that size in 256Mb, and
resorted to
A1 <- vector("list", 1000)
for(i in 1:1000) A1[[i]] <- rnorm(8000)
class(A1) <- "data.frame"
row.names(A1) <- 1:8000
which took 15 secs and 140Mb as an underhand way to make a data frame.
(1.5.1 took too much memory here.)
Then
A2 <- as.matrix(A1)
took 1.8secs (hardly slow) and an additional 64Mb to hold the object A2.
I then deleted A1. Running
write.table(A2, "foo.dat", blocksize=1000)
used about 150Mb in about four minutes. That is formatting 8 million
numbers, and 85% of the time was spent in the system calls, as one should
expect. (I suspect I did not need to delete A1, but didn't want to wait
around to find out.)
So
1) you could have checked your claims by some simple experiments.
2) as claimed, write.matrix does indeed do the job.
On Sun, 11 Aug 2002, Ott Toomet wrote:
I am Continuing discussion about dataframes in ASCII.
I have not overlooked the argument blocksize in write.matrix(), but
which is a sensible size? I assumed the blocksize=1 is the most
memory-efficient, but (for smaller example) I experimented with
different sizes. Initially, speed increased slightly, but seemed to
be constant or even decreasing from around value 5.
A few hundred, probably.
Why did you assume that blocksize=1 was best? R is a vector language, and
it is normally best to use the largest blocks that you can fit in memory.
The problem for me is not the speed for small dataframes but the fact
that I was not able to save a large dataframe at all. I think the
reason is associated with the first line of write.matrix() which is
x <- as.matrix(x)
This converts the whole dataframe into a new ascii matrix, a process which
Not if it is a matrix: what's the function name? For a general data frame
there really is no choice but to convert each column as a whole.
is both slow and memory consuming if the original object is large.
False: see above.
The
second place I am not sure about are lines
cat(format(t(x[nlines + (1:nb), ])), file = file,
append = TRUE, sep = c(rep(sep, p - 1), "\n"))
isn't t(x[...]) creating new temporary objects?
Yes (and so is the format call), but there is garbage collection. That's
one reason why a blocksize of 1 is not at all sensible, forcing the loop
to be run thousands of times. Just choose blocksize to keep this step in
your memory bounds.
Or have I misunderstood something?
Your memory size? I suggest buying another 512Mb/1Gb of RAM.
Brian D. Ripley, ripley at stats.ox.ac.uk
Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel: +44 1865 272861 (self)
1 South Parks Road, +44 1865 272860 (secr)
Oxford OX1 3TG, UK Fax: +44 1865 272595
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
Hi,
True, write.matrix does quite a good job if the data already is in matrix
form. The problem arises using real data (labour force survey in my case),
which includes variables of different storage mode. The dataframe I used
contains mostly integers and factors in character form (most of dataframe
contains NA-s, however).
My computer has 128M memory, R (1.5.1) took 52MB when dataframe e2000 was
loaded (7500x1200). Trying to transform it to a matrix
f2000 <- as.matrix(e2000)
R grew to 155MB after which I killed the process. So, in this case the
block size does not help much.
Best wishes,
Ott
On Sun, 11 Aug 2002 ripley at stats.ox.ac.uk wrote:
|The sort of `large' here is 7500x1200. That's 72Mb if real numbers, so
|let's assume you have at least 256Mb to use. I ran the following on
|Windows with a 256Mb limit (and I had to use R-devel to do so). I actually
|found it difficult to create a data frame of that size in 256Mb, and
|resorted to
|
|A1 <- vector("list", 1000)
|for(i in 1:1000) A1[[i]] <- rnorm(8000)
|class(A1) <- "data.frame"
|row.names(A1) <- 1:8000
|
|which took 15 secs and 140Mb as an underhand way to make a data frame.
|(1.5.1 took too much memory here.)
|
|Then
|
|A2 <- as.matrix(A1)
|
|took 1.8secs (hardly slow) and an additional 64Mb to hold the object A2.
|I then deleted A1. Running
|
|write.table(A2, "foo.dat", blocksize=1000)
|
you mean write.matrix?
|used about 150Mb in about four minutes. That is formatting 8 million
|numbers, and 85% of the time was spent in the system calls, as one should
|expect. (I suspect I did not need to delete A1, but didn't want to wait
|around to find out.)
|
|So
|
|1) you could have checked your claims by some simple experiments.
|
|2) as claimed, write.matrix does indeed do the job.
Agree, given there is sufficent memory and/or the data is of homogeneous
storage mode.
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
On Sun, Aug 11, 2002 at 02:51:33PM +0200, Ott Toomet wrote:
BTW, are there any ways to check memory consumption of individual
objects and functions?
Would 'object.size' correspond to your needs ?
best wishes,
Ott
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._