List, Can R handle very large data sets (say, 100 million records) for data mining applications? My understanding is that Splus can not, but SAS can easily. Thanks, Tony Fagan -------------- next part -------------- An HTML attachment was scrubbed... URL: https://stat.ethz.ch/pipermail/r-help/attachments/19991222/6f333667/attachment.html
Very Large Data Sets
6 messages · Tony Fagan, Peter Malewski, kmself@ix.netcom.com +3 more
Tony Fagan wrote:
List, Can R handle very large data sets (say, 100 million records) for data mining applications? My understanding is that Splus can not, but SAS can easily. Thanks,Tony Fagan
From a theoretical point of view yes, but practically:
1) you'll need plenty of memory 2) even than the computation time will be long In past times I used SPSS to create summarized data file (Mostly there is much more data than really needed). Now I cut the data in a view records, write the syntax, and than run the code in the night. I think the ++ of R is the flexibility of the analyses not the data preparation of very,very large data-bases. Merry Xmas & a happy new year Peter -- ** To YOU I'm an atheist; to God, I'm the Loyal Opposition. Woody Allen ** P.Malewski Tel.: 0531 500965 Maschplatz 8 Email: P.Malewski at tu-bs.de ************************38114 Braunschweig******************************** -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
There are several components to this answer. I'm not too well versed in R, but I've run across the capacity question before. R has a hard limit of 2 GB total memory, as I understand, and its data model requires holding an entire set in memory. This is very fast until it isn't. This limit applies even on 64 bit systems. SAS can "process" a practically infinite data stream, one observation at a time (or more accurately, one read buffer at a time). You can approach this ideal using multiple-volume tape input on a number of OSs. However, this ability is limited to simple and straightforward processing -- DATA step and some very simple procedures. Processing limits for various operations in SAS vary by OS, SAS version, and operation. For 32 bit OSs under releases up through 6.8 - 6.12, 2 GB RAM, 2 GB disk, and 32,767 (2^15 - 1) of many things were hard limits. For various reasons, the hard limits don't apply in all cases, and workarounds were provided in several areas. Under 64 bit OSs, these limits tend to be lifted, though occasionally 32 bit biases sneak through and bite you (there was one such bug in Proc SQL). Traditional limits such as the number of levels (and significant bytes in character variables) treated by PROC FREQ have been greatly increased in versions 7 and 8 of SAS. Other limits are imposed more by the shear size of problems. Many SAS statistical procedures are based on IML and are limited by memory and set size. Even when large memory sets are supported, complex problems with many levels may still exceed the capacity of any system. Moreover, complex statistics may make little sense on such large datasets. When dealing with large datasets outside of SAS, my suggestion would be to look to tools such as Perl and MySQL to handle the procedural and relational processing of data, using R as an analytic tool. Most simple statistics (subsetting, aggregation, drilldown) can be accommodated through these sorts of tools. Think of the relationship to R as the division as between the DATA step and SAS/STAT or SAS/GRAPH. I would be interested to know of any data cube tools which are freely available or available as free software.
On Wed, Dec 22, 1999 at 10:38:30PM -0700, Tony Fagan wrote:
List, Can R handle very large data sets (say, 100 million records) for data mining applications? My understanding is that Splus can not, but SAS can easily. Thanks, Tony Fagan
Karsten M. Self (kmself at ix.netcom.com)
What part of "Gestalt" don't you understand?
SAS for Linux: http://www.netcom.com/~kmself/SAS/SAS4Linux.html
Mailing list: "subscribe sas-linux" to mailto:majordomo at cranfield.ac.uk
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 290 bytes
Desc: not available
Url : https://stat.ethz.ch/pipermail/r-help/attachments/19991223/86196eaa/attachment.bin
kmself at ix.netcom.com writes:
When dealing with large datasets outside of SAS, my suggestion would be to look to tools such as Perl and MySQL to handle the procedural and relational processing of data, using R as an analytic tool. Most simple statistics (subsetting, aggregation, drilldown) can be accommodated through these sorts of tools. Think of the relationship to R as the division as between the DATA step and SAS/STAT or SAS/GRAPH.
I would be interested to know of any data cube tools which are freely available or available as free software.
I am considering this type of approach for an application that will involve very large data sets. I will probably use the python scripting language rather than perl but the general approach is as you describe. We currently have some code in R packages to read Stata data files and (in the "foreign" package of the src/contrib/Devel section on CRAN) to read SAS XPORT format data libraries. These packages can help to move data from one format to another they don't help with dealing with massive numbers of records in R's memory-based model. When faced with a large data set I first want to determine the representation of the data and some basic summaries. After that I might want to work with a subset of the rows and/or columns when doing some modeling and only use the entire data set to refine or confirm the model. My idea is to take standard data formats for data tables (SAS XPORT format, SPSS sav files, ...), encapsulate them as python classes, and provide methods that would summarize the columns and perhaps emulate Martin Maechler's excellent "str" function from R. For example, I would want to know if every value of a numeric variable happened to be an integer and always in the range from 1 up to 10, or something like that. This would indicate to me that it was probably a coding of a factor and not a numeric variable. The summary methods should only require calculations that can be done a row at a time. Thus calculating minima and maxima is reasonable but getting medians and quartiles is not. The classes would not import all the data into python - they would simply keep around enough information to read the data a row at a time on demand. Each of these classes would include a method to store the data as a table in a relational database system. There are python packages for most common SQL databases, including the freely available PostgreSQL and mySQL. The inverse transformation, SQL table to proprietary data format, would also be provided. To work on a subset of the data within R we could try to enhance the functions that read data from foreign formats to allow selection of rows or columns. However, as you suggest, that job is probably best handled using SQL and functions within R that extract tables or views from the SQL database. I would note that Timothy Keitt has just contributed an R package to interface with PostgreSQL. Trying to write this type of code teaches you interesting things. As far as I can tell, you cannot discover the number of rows in a SAS dataset from the header information. The number of rows is not recorded there and, because more than one dataset can be stored in a library file, you cannot use the length of the file and the length of the record to calculate the number of records (rows). If you want to allocate storage to hold the data in R the simplest thing to do is is to read the file once to discover the structure then read it again to import the data. It shows you that the idea that your data are sitting on a reel of tape over at the "computing center" is wired into the SAS system at a very low level. -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
On Thu, 23 Dec 1999 kmself at ix.netcom.com wrote:
When dealing with large datasets outside of SAS, my suggestion would be to look to tools such as Perl and MySQL to handle the procedural and relational processing of data, using R as an analytic tool. Most simple statistics (subsetting, aggregation, drilldown) can be accommodated through these sorts of tools. Think of the relationship to R as the division as between the DATA step and SAS/STAT or SAS/GRAPH. I would be interested to know of any data cube tools which are freely available or available as free software.
The S-PLUS package for the netCDF format, written by Steve Oncley of NCAR, allows reading of arbitrary "slabs" of a very large data file. At one point he was planning to write an R version, but I can't remember what happened and my email records for the relevant time were eaten by a Microsoft Outlook/Pine disagreement. This would allow you to work with large data files one piece at a time (if they were netCDF files). Something similar could be done with mmap(2) if your OS allows addressing that much memory (which they mostly will soon). Thomas Lumley Assistant Professor, Biostatistics University of Washington, Seattle -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
On Wed, 22 Dec 1999, Tony Fagan wrote:
List, Can R handle very large data sets (say, 100 million records) for data mining applications? My understanding is that Splus can not, but SAS can easily. Thanks, Tony Fagan
There have been a couple of posts about approaching this large-dataset problem with the MySQL/Python/R combination. I will simply add some information (a testimonial) about my experiences with this as a possible solution. This combination has worked very, very well for me. As a former SAS and Windows user, I decided to perform my dissertation data analyses using FreeBSD, which does not run SAS. After about a year of tinkering around with different ways to approach the problem of analyzing my dissertation data (i.e., moderately large ~1.5 million obs of psychophysiological data), I have settled on this MySQL/Python/R combination. In order to get to this stage, I looked into several other solutions (e.g., Perl Data Language, PostgreSQL, Ox, APL, Perl, etc.), but this combination met my needs best. For my purposes, I find this solution to be better than any other (including SAS). MySQL is very, very fast, especially when using an index. Just last night, I could not believe how quickly it created an R dataset for me (only 30 seconds on an slow machine---486DX 66Mhz---for a complex join of four tables, each table containing about 500K rows). For most data-analytic purposes, I go directly from (1) subsetting the data in MySQL to (2) performing more sophisticated data analyses in R. For some more complex queries, the Python link is needed, but not for most (Python, of course, is useful for many other reasons than linking from MySQL to R). For my dissertation data, there is no reason for me to analyze all 1.5 million rows at once. Rather, I need to perform the same statistical procedures, one or two subjects at a time (i.e., 2400 rows), over and over again. I let the SQL backend do the large, number-crunching work and let R shine for statistics, and it really does shine... Testimonially yours, Loren ------------------------------- Loren Michael McCarter Graduate Student-UC Berkeley -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._