R's memory limitation and Hadoop

7 messages · Barry King, John McKown, Jeff Newmiller +4 more

Original

1

7

Barry King

Tue, Sep 16, 2014 4:40 AM #

Is there a way to get around R?s memory-bound limitation by interfacing
with a Hadoop database or should I look at products like SAS or JMP to work
with data that has hundreds of thousands of records?  Any help is
appreciated.

__________________________
*Barry E. King, Ph.D.*
Analytics Modeler
Qualex Consulting Services, Inc.
Barry.King at qlx.com
O: (317)940-5464
M: (317)507-0661
__________________________

	[[alternative HTML version deleted]]

John McKown

Tue, Sep 16, 2014 5:01 AM #

On Tue, Sep 16, 2014 at 6:40 AM, Barry King <barry.king at qlx.com> wrote:

__________________________
*Barry E. King, Ph.D.*
Analytics Modeler

Please change your email to plain text only, per forum standards.

You might want to look at bigmemory.
http://cran.revolutionanalytics.com/web/packages/bigmemory/index.html

There is nothing more pleasant than traveling and meeting new people!
Genghis Khan

Maranatha! <><
John McKown

Tue, Sep 16, 2014 5:27 AM #

If you need to start your question with a false dichotomy, by all means choose the option you seem to have already chosen and stop trolling us.
If you actually want an answer here, try Googling on the topic first (is "R hadoop" so un-obvious?) and then phrase a specific question so someone has a chance to help you.
---------------------------------------------------------------------------
Jeff Newmiller                        The     .....       .....  Go Live...
DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live Go...
                                      Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
/Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k
--------------------------------------------------------------------------- 
Sent from my phone. Please excuse my brevity.

On September 16, 2014 4:40:29 AM PDT, Barry King <barry.king at qlx.com> wrote:

__________________________
*Barry E. King, Ph.D.*
Analytics Modeler
Qualex Consulting Services, Inc.
Barry.King at qlx.com
O: (317)940-5464
M: (317)507-0661
__________________________

[[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Tue, Sep 16, 2014 5:56 AM #

Not sure trolling was intended here.

Anyways:

Yes, there are ways of working with very large datasets in R, using databases or otherwise. Check the CRAN task views. 

SAS will for _some_ purposes be able to avoid overflowing RAM by using sequential file access. The biglm package is an example of using similar techniques in R. SAS is not (to my knowledge) able to do this invariably, some procedures may need to load the entire data set into RAM.

JMP's data tables are limited by available RAM, just like R's are.

R does have somewhat inefficient memory strategies (e.g., model matrices may include multiple columns of binary variables, each using 8 bytes per entry), so may run out of memory sooner than other programs, but it is not like the competition is not RAM-restricted at all.

- Peter D.

On 16 Sep 2014, at 14:27 , Jeff Newmiller <jdnewmil at dcn.davis.ca.us> wrote:

__________________________
*Barry E. King, Ph.D.*
Analytics Modeler
Qualex Consulting Services, Inc.
Barry.King at qlx.com
O: (317)940-5464
M: (317)507-0661
__________________________

	[[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com

Tue, Sep 16, 2014 6:53 AM #

Hundreds of thousands of records usually fit into memory fine.

Hadley

On Tue, Sep 16, 2014 at 12:40 PM, Barry King <barry.king at qlx.com> wrote:

__________________________
*Barry E. King, Ph.D.*
Analytics Modeler
Qualex Consulting Services, Inc.
Barry.King at qlx.com
O: (317)940-5464
M: (317)507-0661
__________________________

        [[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

http://had.co.nz/

Brian Ripley

Tue, Sep 16, 2014 11:47 AM #

On 16/09/2014 13:56, peter dalgaard wrote:

Also 'hundreds of thousands of records' is not really very much: I have 
seen analyses of millions many times[*]: I have analysed a few billion 
with 0.3TB of RAM.

[*] I recall a student fitting a GLM with about 30 predictors to 1.5m 
records: at the time (ca R 2.14) it did not fit in 4GB but did in 8GB.

__________________________
*Barry E. King, Ph.D.*
Analytics Modeler
Qualex Consulting Services, Inc.
Barry.King at qlx.com
O: (317)940-5464
M: (317)507-0661
__________________________

Brian D. Ripley,                  ripley at stats.ox.ac.uk
Emeritus Professor of Applied Statistics, University of Oxford
1 South Parks Road, Oxford OX1 3TG, UK

Tue, Sep 16, 2014 12:14 PM #

You can easily run out of memory when a few of the variables are
factors, each with many levels, and the user looks for interactions
between them.  This can happen by accident if your data was imported
with read.table() and a variable meant to be numeric was read as
factor (or character).  str(yourData) would tell you about this
problem.

Bill Dunlap
TIBCO Software
wdunlap tibco.com


On Tue, Sep 16, 2014 at 11:47 AM, Prof Brian Ripley

<ripley at stats.ox.ac.uk> wrote:

__________________________
*Barry E. King, Ph.D.*
Analytics Modeler
Qualex Consulting Services, Inc.
Barry.King at qlx.com
O: (317)940-5464
M: (317)507-0661
__________________________

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.