Very Large Data Sets - R-help

Wed, Dec 22, 1999 9:38 PM #

List,

Can R handle very large data sets (say, 100 million records) for data mining applications? My understanding is that Splus can not, but SAS can easily.

Thanks,
Tony Fagan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://stat.ethz.ch/pipermail/r-help/attachments/19991222/6f333667/attachment.html

Peter Malewski

Wed, Dec 22, 1999 11:32 PM #

Tony Fagan wrote:

1) you'll need plenty of memory
2) even than the computation time will be long

In past times I used SPSS to create summarized data file (Mostly there is much more data than really needed). Now I cut the data in a view records, write the syntax, and than run the code in the night.

I think the ++ of R is the flexibility of the analyses not the data preparation of very,very large data-bases.

Merry Xmas & a happy new year

Peter

--
** To YOU I'm an atheist; to God, I'm the Loyal Opposition. Woody Allen **
P.Malewski                                      Tel.: 0531 500965
Maschplatz 8                                    Email: P.Malewski at tu-bs.de
************************38114 Braunschweig********************************



-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._

kmself@ix.netcom.com

Thu, Dec 23, 1999 2:22 AM #

There are several components to this answer.  I'm not too well versed in
R, but I've run across the capacity question before.

R has a hard limit of 2 GB total memory, as I understand, and its data
model requires holding an entire set in memory.  This is very fast until
it isn't.  This limit applies even on 64 bit systems.

SAS can "process" a practically infinite data stream, one observation at
a time (or more accurately, one read buffer at a time).  You can
approach this ideal using multiple-volume tape input on a number of OSs.
However, this ability is limited to simple and straightforward
processing -- DATA step and some very simple procedures.

Processing limits for various operations in SAS vary by OS, SAS version,
and operation.  For 32 bit OSs under releases up through 6.8 - 6.12, 2
GB RAM, 2 GB disk, and 32,767 (2^15 - 1) of many things were hard
limits.   For various reasons, the hard limits don't apply in all cases,
and workarounds were provided in several areas.

Under 64 bit OSs, these limits tend to be lifted, though occasionally 32
bit biases sneak through and bite you (there was one such bug in Proc SQL).  
Traditional limits such as the number of levels (and significant bytes
in character variables) treated by PROC FREQ have been greatly increased
in versions 7 and 8 of SAS.

Other limits are imposed more by the shear size of problems.  Many SAS
statistical procedures are based on IML and are limited by memory and
set size.  Even when large memory sets are supported, complex problems
with many levels may still exceed the capacity of any system.  Moreover,
complex statistics may make little sense on such large datasets.


When dealing with large datasets outside of SAS, my suggestion would be
to look to tools such as Perl and MySQL to handle the procedural and
relational processing of data, using R as an analytic tool.  Most simple
statistics (subsetting, aggregation, drilldown) can be accommodated
through these sorts of tools.   Think of the relationship to R as the
division as between the DATA step and SAS/STAT or SAS/GRAPH.

I would be interested to know of any data cube tools which are freely
available or available as free software.

On Wed, Dec 22, 1999 at 10:38:30PM -0700, Tony Fagan wrote:

Karsten M. Self (kmself at ix.netcom.com)
    What part of "Gestalt" don't you understand?

SAS for Linux: http://www.netcom.com/~kmself/SAS/SAS4Linux.html
Mailing list:  "subscribe sas-linux" to mailto:majordomo at cranfield.ac.uk
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 290 bytes
Desc: not available
Url : https://stat.ethz.ch/pipermail/r-help/attachments/19991223/86196eaa/attachment.bin

Douglas Bates

Thu, Dec 23, 1999 6:02 AM #

kmself at ix.netcom.com writes:

I am considering this type of approach for an application that will
involve very large data sets.  I will probably use the python
scripting language rather than perl but the general approach is as you describe.

We currently have some code in R packages to read Stata data files and
(in the "foreign" package of the src/contrib/Devel section on CRAN) to
read SAS XPORT format data libraries.  These packages can help to move
data from one format to another they don't help with dealing with
massive numbers of records in R's memory-based model.

When faced with a large data set I first want to determine the
representation of the data and some basic summaries.  After that I
might want to work with a subset of the rows and/or columns when doing
some modeling and only use the entire data set to refine or confirm
the model.

My idea is to take standard data formats for data tables (SAS XPORT
format, SPSS sav files, ...), encapsulate them as python classes, and
provide methods that would summarize the columns and perhaps emulate
Martin Maechler's excellent "str" function from R.  For example, I
would want to know if every value of a numeric variable happened to be
an integer and always in the range from 1 up to 10, or something like
that.  This would indicate to me that it was probably a coding of a
factor and not a numeric variable.  The summary methods should only
require calculations that can be done a row at a time.  Thus
calculating minima and maxima is reasonable but getting medians and
quartiles is not.  The classes would not import all the data into
python - they would simply keep around enough information to read the
data a row at a time on demand.

Each of these classes would include a method to store the data as a
table in a relational database system.  There are python packages for
most common SQL databases, including the freely available PostgreSQL
and mySQL.  The inverse transformation, SQL table to proprietary data
format, would also be provided.

To work on a subset of the data within R we could try to enhance the
functions that read data from foreign formats to allow selection of
rows or columns.  However, as you suggest, that job is probably best
handled using SQL and functions within R that extract tables or views
from the SQL database.  I would note that Timothy Keitt has just
contributed an R package to interface with PostgreSQL.

Trying to write this type of code teaches you interesting things.  As
far as I can tell, you cannot discover the number of rows in a SAS
dataset from the header information.  The number of rows is not
recorded there and, because more than one dataset can be stored in a
library file, you cannot use the length of the file and the length of
the record to calculate the number of records (rows).  If you want to
allocate storage to hold the data in R the simplest thing to do is
is to read the file once to discover the structure then read
it again to import the data.  It shows you that the idea that your
data are sitting on a reel of tape over at the "computing center" is
wired into the SAS system at a very low level.
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._

Thomas Lumley

Thu, Dec 23, 1999 8:49 AM #

On Thu, 23 Dec 1999 kmself at ix.netcom.com wrote:

The S-PLUS package for the netCDF format, written by Steve Oncley of NCAR,
allows reading of arbitrary "slabs" of a very large data file. At one
point he was planning to write an R version, but I can't remember what
happened and my email records for the relevant time were eaten by a
Microsoft Outlook/Pine disagreement. 

This would allow you to work with large data files one piece at a time (if
they were netCDF files). Something similar could be done with mmap(2) if
your OS allows addressing that much memory (which they mostly will soon). 

	
Thomas Lumley
Assistant Professor, Biostatistics
University of Washington, Seattle

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._

Loren M. McCarter

Thu, Dec 23, 1999 9:33 AM #

On Wed, 22 Dec 1999, Tony Fagan wrote:

There have been a couple of posts about approaching this large-dataset
problem with the MySQL/Python/R combination. I will simply add some
information (a testimonial) about my experiences with this as a possible
solution. This combination has worked very, very well for me. As a former
SAS and Windows user, I decided to perform my dissertation data analyses
using FreeBSD, which does not run SAS. After about a year of tinkering
around with different ways to approach the problem of analyzing my
dissertation data (i.e., moderately large ~1.5 million obs of
psychophysiological data), I have settled on this MySQL/Python/R
combination. In order to get to this stage, I looked into several other
solutions (e.g., Perl Data Language, PostgreSQL, Ox, APL, Perl, etc.), but
this combination met my needs best. 

For my purposes, I find this solution to be better than any other 
(including SAS). MySQL is very, very fast, especially when using
an index. Just last night, I could not believe how quickly it created
an R dataset for me (only 30 seconds on an slow machine---486DX
66Mhz---for a complex join of four tables, each table containing about
500K rows). For most data-analytic purposes, I go directly from (1)
subsetting the data in MySQL to (2) performing more sophisticated data
analyses in R. For some more complex queries, the Python
link is needed, but not for most (Python, of course, is useful for many
other reasons than linking from MySQL to R).

For my dissertation data, there is no reason for me to analyze all 1.5 
million rows at once. Rather, I need to perform the same statistical procedures,
one or two subjects at a time (i.e., 2400 rows), over and over again. I
let the SQL backend do the large, number-crunching work and let R shine
for statistics, and it really does shine...

Testimonially yours,

Loren




-------------------------------

Loren Michael McCarter
Graduate Student-UC Berkeley



-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._