"DougB" == Douglas Bates <bates@stat.wisc.edu> writes:
DougB> At present the example data sets in R libraries are to be given as
DougB> expressions that can be read directly into R. For example, the acid.R
DougB> file in the main library looks like
DougB> acid <- data.frame(
DougB> carb = c(0.1, 0.3, 0.5, 0.6, 0.7, 0.9),
DougB> optden = c(0.086, 0.269, 0.446, 0.538, 0.626, 0.782), row.names = paste(1:6))
DougB> This is great when you have only a few observations. I have one
DougB> example data set with over 9000 rows and 17 variables. Even when I
DougB> set -v 40, I exhaust the available memory trying to read it in as a
DougB> data.frame. I believe this is because of the recursive nature of the
DougB> parsing of data objects.
yes;
DougB> Are there alternatives that would cause less memory usage?
yes; but only in the 0.62 development version.
The current 0.62 ``standard'' is:
if a 'data' file ends in
.R, source(.) is used to read it
if it ends in
.tab read.table(..., header = TRUE) is used to read it.
(you find the new data(.) function in src/library/base/data in R-snapshot.)
Note that this is still not really satisfactory for large data files,
since read.table(.) is not really efficient:
it first reads everything as character matrix and then converts
variable by variable, some to numeric, some to factor.
On the other hand: does it really make sense to distribute huge example
data sets as yours above?
If yes, AND if you have only numeric data,
I'd propose the following:
1) create a <pkg>/data/dougBex.R
file which only contains something like
dougBex <- as.data.frame(
matrix(scan(system.file("<pkg>/data/dougBex.dat")),
ncol = ...,
dimnames = ...))
2) create <pkg>/data/dougBex.dat to contain all your data, white-space
delimited numeric.
DougB> In S/S-PLUS the data.dump/data.restore functions use a portable
DougB> representation that can be parsed without exponential memory growth.
hmm, yes, we have been longing for someone to write data.dump/data.restore
for R.
Any volunteers?
--
Martin Maechler <maechler@stat.math.ethz.ch> <><
Seminar fuer Statistik, ETH-Zentrum SOL G1; Sonneggstr.33
ETH (Federal Inst. Technology) 8092 Zurich SWITZERLAND
phone: x-41-1-632-3408 fax: ...-1086
http://www.stat.math.ethz.ch/~maechler/
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !) To: r-devel-request@stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
Martin Maechler <maechler@stat.math.ethz.ch> writes:
On the other hand: does it really make sense to distribute huge example
data sets as yours above?
The purpose of this example is to show that the lme methods work with
very large data sets. These data are from a survey conducted by a
sociologist. He fit a mixed-effects model to them using SAS PROC
MIXED. It took five hours of cpu time on a relatively fast machine
(Pentium II 233 MHz, 64 Mb memory). Once I decide what the model he
used is in our notation, I will try it in lme. I am confident we
can do it much faster.
I decide to omit this data set from the standard distribution for lme
although the way that data sets are organized in R there is not much
penalty other than the disc space for including large examples that
are rarely used.
Following Thomas's suggestion of increasing the -n as well as the -v
option I was able to read the data in with its current form.
Douglas Bates bates@stat.wisc.edu
Statistics Department 608/262-2598
University of Wisconsin - Madison http://www.stat.wisc.edu/~bates/
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !) To: r-devel-request@stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._