R vs. S-PLUS vs. SAS
On Sat, Dec 04, 2004 at 07:15:40AM -0500, Andrew Piskorski wrote:
On Fri, Dec 03, 2004 at 06:37:15PM +0000, Patrick Burns wrote:
There may be some differences between SAS procedures, but at least generally SAS does not require the whole data to be in RAM. Regression will take the data row by row and do an update for the answer.
Someone might want to ask Joe Conway about his experience and thoughts integrating R as a procedural language inside PostgreSQL, to create PL/R: http://www.joeconway.com/plr/ http://gborg.postgresql.org/project/plr/projdisplay.php (Hm, for good measure, I have Cc'd him on this email.) Obviously, an
Very good point, but you didn't CC Joe. Done now. Hi Joe :)
RDBMS like PostgreSQL is expert at dealing with data that doesn't fit into RAM. I've no idea whether PL/R does anything special to take advantage of that, or how feasible it would be to do so. Does anyone here know much about what makes R dependent on all data being in RAM, or of links to same? Is it just some centralized low-level bits, or do broad swaths of code and algorithms all depend on the in-RAM assumption?
Discount my $0.02 severely enough as I don't really know what I am rambling about, but here it goes anyway as talk is so cheap: S implementations are from a 'workstation' design era. Data objects are in Ram. As Pat mentioned in this thread, they used to be way less efficient than it is now. R made huge leaps. I haven Our friendly listmembers from Insightful way want to complement me here with factual data :)
How do SAS and other such systems avoid that? Do they do this better
SAS reflects its mainframe-age design, i.e. pass (efficiently) over huge amounts of data that could never have been held in memory anyway. The interactive/exploratory/graphical nature of S versus the batch/non-interactive/non-graphical nature of SAS follows from relative cleanly from that basic design premise.
or much more more transparently than what an R user would do now manually? Where by "manually", I mean, query some fits-in-RAM amount data out of an RDBMS (or other such on-disk store), analyze it, delete the data to free up RAM, and repeat. Could one say, tie a light-weight high-performance RDBMS library, like SQLite, into R, and have R use it profitably to scale nicely on data that does not fit in RAM? In what way, if any, would this offer a substantial advantage over current manual R-plus-RDBMS practice?
Fei Chen, a doctoral student of Brian Ripley, gave a truly impressive presentation at DSC 2003 about out-of-memory work with R. I bugged Brian repeatedly about writeups on this, but apparently there are none. Fei now is a professional data miner on truly gigantic data sets ... It can be done, but it requires surgery on the engine. For someone really committed, it may be worth digging up Fei Chen's dissertation. Might even be a market niche for Insightful to explore. Dirk
If you don't go with R now, you will someday. -- David Kane on r-sig-finance, 30 Nov 2004