memory problem in handling large dataset

If my calculation is correct (very doubtful, sometimes), that's
1.7e9 * (300 * 8 + 50 * 4) / 1024^3
[1] 4116.446

or over 4 terabytes, just to store the data in memory.

To sample rows and read that into R, Bert's suggestion of using connections,
perhaps along with seek() for skipping ahead, would be what I'd try.  I had
try to do such things in Python as a chance to learn that language, but I
found operationally it's easier to maintain the project by doing everything
in one language, namely R, if possible.

Andy
From: Berton Gunter

I think the general advice is that around 1/4 or 1/3 of your available
memory is about the largest data set that R can handle -- and often
considerably less depending upon what you do and how you do 
it (because R's
semantics require explicitly copying objects rather than 
passing pointers).
Fancy tricks using environments might enable you to do 
better, but that
requires advice from a true guru, which I ain't.

See ?connections, ?scan, ?seek  for reading in a file a chunk 
at a time from
a connection, thus enabling you to sample one line of data 
from each chunk,
say.

I suppose you could do this directly with repeated calls to scan() or
read.table() by skipping more and more lines at the beginning 
at each call,
but I assume that is horridly inefficient and would take forever.

HTH.

-- Bert Gunter
Genentech Non-Clinical Statistics
South San Francisco, CA

"The business of the statistician is to catalyze the 
scientific learning
process."  - George E. P. Box

-----Original Message-----
From: r-help-bounces at stat.math.ethz.ch 
[mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Weiwei Shi
Sent: Thursday, October 27, 2005 9:28 AM
To: r-help
Subject: [R] memory problem in handling large dataset

Dear Listers:
I have a question on handling large dataset. I searched 
R-Search and I
hope I can get more information as to my specific case.

First, my dataset has 1.7 billion observations and 350 variables,
among which, 300 are float and 50 are integers.
My system has 8 G memory, 64bit CPU, linux box. (currently, we don't
plan to buy more memory).

R.version
         _
platform i686-redhat-linux-gnu
arch     i686
os       linux-gnu
system   i686, linux-gnu
status
major    2
minor    1.1
year     2005
month    06
day      20
language R

If I want to do some analysis for example like randomForest on a
dataset, how many max observations can I load to get the machine run
smoothly?

After figuring out that number, I want to do some sampling 
first, but
I did not find read.table or scan can do this. I guess I can load it
into mysql and then use RMySQL do the sampling or use 
python to subset
the data first. My question is, is there a way I can subsample
directly from file just using R?

Thanks,
--
Weiwei Shi, Ph.D

"Did you always know?"
"No, I did not. But I believed..."
---Matrix III

______________________________________________
R-help at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! 
http://www.R-project.org/posting-guide.html

Dear Andy:
I think our emails crossed. But thanks as before.

Weiwei
If my calculation is correct (very doubtful, sometimes), that's

1.7e9 * (300 * 8 + 50 * 4) / 1024^3
[1] 4116.446

or over 4 terabytes, just to store the data in memory.

To sample rows and read that into R, Bert's suggestion of using connections,
perhaps along with seek() for skipping ahead, would be what I'd try.  I had
try to do such things in Python as a chance to learn that language, but I
found operationally it's easier to maintain the project by doing everything
in one language, namely R, if possible.

Andy

From: Berton Gunter

I think the general advice is that around 1/4 or 1/3 of your available
memory is about the largest data set that R can handle -- and often
considerably less depending upon what you do and how you do
it (because R's
semantics require explicitly copying objects rather than
passing pointers).
Fancy tricks using environments might enable you to do
better, but that
requires advice from a true guru, which I ain't.

See ?connections, ?scan, ?seek  for reading in a file a chunk
at a time from
a connection, thus enabling you to sample one line of data
from each chunk,
say.

I suppose you could do this directly with repeated calls to scan() or
read.table() by skipping more and more lines at the beginning
at each call,
but I assume that is horridly inefficient and would take forever.

HTH.

-- Bert Gunter
Genentech Non-Clinical Statistics
South San Francisco, CA

"The business of the statistician is to catalyze the
scientific learning
process."  - George E. P. Box

-----Original Message-----
From: r-help-bounces at stat.math.ethz.ch
[mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Weiwei Shi
Sent: Thursday, October 27, 2005 9:28 AM
To: r-help
Subject: [R] memory problem in handling large dataset

Dear Listers:
I have a question on handling large dataset. I searched
R-Search and I
hope I can get more information as to my specific case.

First, my dataset has 1.7 billion observations and 350 variables,
among which, 300 are float and 50 are integers.
My system has 8 G memory, 64bit CPU, linux box. (currently, we don't
plan to buy more memory).

R.version
         _
platform i686-redhat-linux-gnu
arch     i686
os       linux-gnu
system   i686, linux-gnu
status
major    2
minor    1.1
year     2005
month    06
day      20
language R

If I want to do some analysis for example like randomForest on a
dataset, how many max observations can I load to get the machine run
smoothly?

After figuring out that number, I want to do some sampling
first, but
I did not find read.table or scan can do this. I guess I can load it
into mysql and then use RMySQL do the sampling or use
python to subset
the data first. My question is, is there a way I can subsample
directly from file just using R?

Thanks,
--
Weiwei Shi, Ph.D

"Did you always know?"
"No, I did not. But I believed..."
---Matrix III

______________________________________________
R-help at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html

------------------------------------------------------------------------------
Notice:  This e-mail message, together with any attachment...{{dropped}}
An alternative could be to store data in a MySql database and then select a sample of the cases using the RODBC package.
Best
S??ren

________________________________

Fra: r-help-bounces at stat.math.ethz.ch p?? vegne af Liaw, Andy
Sendt: to 27-10-2005 19:21
Til: 'Berton Gunter'; 'Weiwei Shi'; 'r-help'
Emne: Re: [R] memory problem in handling large dataset

If my calculation is correct (very doubtful, sometimes), that's
1.7e9 * (300 * 8 + 50 * 4) / 1024^3
[1] 4116.446

or over 4 terabytes, just to store the data in memory.

To sample rows and read that into R, Bert's suggestion of using connections,
perhaps along with seek() for skipping ahead, would be what I'd try.  I had
try to do such things in Python as a chance to learn that language, but I
found operationally it's easier to maintain the project by doing everything
in one language, namely R, if possible.

Andy
From: Berton Gunter

I think the general advice is that around 1/4 or 1/3 of your available
memory is about the largest data set that R can handle -- and often
considerably less depending upon what you do and how you do
it (because R's
semantics require explicitly copying objects rather than
passing pointers).
Fancy tricks using environments might enable you to do
better, but that
requires advice from a true guru, which I ain't.

See ?connections, ?scan, ?seek  for reading in a file a chunk
at a time from
a connection, thus enabling you to sample one line of data
from each chunk,
say.

I suppose you could do this directly with repeated calls to scan() or
read.table() by skipping more and more lines at the beginning
at each call,
but I assume that is horridly inefficient and would take forever.

HTH.

-- Bert Gunter
Genentech Non-Clinical Statistics
South San Francisco, CA

"The business of the statistician is to catalyze the
scientific learning
process."  - George E. P. Box

-----Original Message-----
From: r-help-bounces at stat.math.ethz.ch
[mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Weiwei Shi
Sent: Thursday, October 27, 2005 9:28 AM
To: r-help
Subject: [R] memory problem in handling large dataset

Dear Listers:
I have a question on handling large dataset. I searched
R-Search and I
hope I can get more information as to my specific case.

First, my dataset has 1.7 billion observations and 350 variables,
among which, 300 are float and 50 are integers.
My system has 8 G memory, 64bit CPU, linux box. (currently, we don't
plan to buy more memory).

R.version
         _
platform i686-redhat-linux-gnu
arch     i686
os       linux-gnu
system   i686, linux-gnu
status
major    2
minor    1.1
year     2005
month    06
day      20
language R

If I want to do some analysis for example like randomForest on a
dataset, how many max observations can I load to get the machine run
smoothly?

After figuring out that number, I want to do some sampling
first, but
I did not find read.table or scan can do this. I guess I can load it
into mysql and then use RMySQL do the sampling or use
python to subset
the data first. My question is, is there a way I can subsample
directly from file just using R?

Thanks,
--
Weiwei Shi, Ph.D

"Did you always know?"
"No, I did not. But I believed..."
---Matrix III

______________________________________________
R-help at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html

______________________________________________
R-help at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html