Representation of data in libraries - R-devel

Tue, Feb 24, 1998 11:34 PM #

DougB> At present the example data sets in R libraries are to be given as
    DougB> expressions that can be read directly into R.  For example, the acid.R 
    DougB> file in the main library looks like
    DougB> acid <- data.frame(
    DougB> carb  = c(0.1, 0.3, 0.5, 0.6, 0.7, 0.9),
    DougB> optden = c(0.086, 0.269, 0.446, 0.538, 0.626, 0.782), row.names = paste(1:6))

    DougB> This is great when you have only a few observations.  I have one
    DougB> example data set with over 9000 rows and 17 variables.  Even when I
    DougB> set -v 40, I exhaust the available memory trying to read it in as a
    DougB> data.frame.  I believe this is because of the recursive nature of the
    DougB> parsing of data objects.

yes; 

    DougB> Are there alternatives that would cause less memory usage?

yes; but only in the 0.62 development version.
The current 0.62 ``standard'' is:

if a 'data' file ends in
	.R,	source(.) is used to read it
if it ends in
	.tab	read.table(..., header = TRUE)  is used to read it.
(you find the new data(.) function in  src/library/base/data in R-snapshot.)

Note that this is still not really satisfactory for large data files,
since read.table(.) is not really efficient:
	it first reads everything as character matrix and then converts
	variable by variable, some to numeric, some to factor.

On the other hand: does it really make sense to distribute huge example
data sets as yours above?
If yes, AND if you have only numeric data,
I'd propose the following:
 1) create a  <pkg>/data/dougBex.R
    file which only contains something like
	dougBex <- as.data.frame(
		    matrix(scan(system.file("<pkg>/data/dougBex.dat")),
			   ncol = ...,  
			   dimnames = ...))
 2) create   <pkg>/data/dougBex.dat  to contain all your data, white-space
				     delimited numeric.


    DougB> In S/S-PLUS the data.dump/data.restore functions use a portable
    DougB> representation that can be parsed without exponential memory growth.

hmm, yes, we have been longing for someone to write  data.dump/data.restore
for R.
	Any volunteers?

--
Martin Maechler <maechler@stat.math.ethz.ch>			<><
Seminar fuer Statistik, ETH-Zentrum SOL G1;	Sonneggstr.33
ETH (Federal Inst. Technology)	8092 Zurich	SWITZERLAND
phone: x-41-1-632-3408		fax: ...-1086
http://www.stat.math.ethz.ch/~maechler/
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-devel-request@stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._

Douglas Bates

Thu, Feb 26, 1998 7:57 AM #

Martin Maechler <maechler@stat.math.ethz.ch> writes:

The purpose of this example is to show that the lme methods work with
very large data sets.  These data are from a survey conducted by a
sociologist.  He fit a mixed-effects model to them using SAS PROC
MIXED.  It took five hours of cpu time on a relatively fast machine
(Pentium II 233 MHz, 64 Mb memory).  Once I decide what the model he
used is in our notation, I will try it in lme.  I am confident we
can do it much faster.

I decide to omit this data set from the standard distribution for lme
although the way that data sets are organized in R there is not much
penalty other than the disc space for including large examples that
are rarely used.

Following Thomas's suggestion of increasing the -n as well as the -v
option I was able to read the data in with its current form.

Douglas Bates                            bates@stat.wisc.edu
Statistics Department                    608/262-2598
University of Wisconsin - Madison        http://www.stat.wisc.edu/~bates/
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-devel-request@stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._