Hello all, I?m trying to take a huge dataset (1.5 GB) and separate it into smaller chunks with R. So far I had nothing but problems. I cannot load the whole dataset in R due to memory problems. So, I instead try to load a few (100000) lines at a time (with read.table). However, R kept crashing (with no error message) at about the 6800000 line. This is extremely frustrating. To try to fix this, I used connections with read.table. However, I now get a cryptic error telling me ?no lines available in input?. Is there any way to make this work? Best, Guillaume
How to separate huge dataset into chunks
4 messages · Thomas Lumley, Guillaume Filteau
On Tue, 24 Mar 2009, Guillaume Filteau wrote:
Hello all, I?m trying to take a huge dataset (1.5 GB) and separate it into smaller chunks with R. So far I had nothing but problems. I cannot load the whole dataset in R due to memory problems. So, I instead try to load a few (100000) lines at a time (with read.table). However, R kept crashing (with no error message) at about the 6800000 line. This is extremely frustrating. To try to fix this, I used connections with read.table. However, I now get a cryptic error telling me ?no lines available in input?. Is there any way to make this work?
There might be an error in line 42 of your script. Or somewhere else. The error message is cryptically saying that there were no lines of text available in the input connection, so presumably the connection wasn't pointed at your file correctly.
It's hard to guess without seeing what you are doing, but
conn <- file("mybigfile", open="r")
chunk<- read.table(conn, header=TRUE, nrows=10000)
nms <- names(chunk)
while(length(chunk)==10000){
chunk<-read.table(conn, nrows=10000,col.names=nms)
## do something to the chunk
}
close(conn)
should work. This sort of thing certainly does work routinely.
It's probably not worth reading 100,000 lines at a time unless your computer has a lot of memory. Reducing the chunk size to 10,000 shouldn't introduce much extra overhead and may well increase the speed by reducing memory use.
-thomas
Thomas Lumley Assoc. Professor, Biostatistics
tlumley at u.washington.edu University of Washington, Seattle
1 day later
Hello Thomas, Thanks for your help! Sadly your code does not work for the last chunk, because its length is shorter than nrows. I tried try(chunk<-read.table(conn, nrows=10000,col.names=nms), silent=TRUE) but it gives me an error (go figure!) Best, Guillaume Quoting Thomas Lumley <tlumley at u.washington.edu>:
On Tue, 24 Mar 2009, Guillaume Filteau wrote:
Hello all, I?m trying to take a huge dataset (1.5 GB) and separate it into smaller chunks with R. So far I had nothing but problems. I cannot load the whole dataset in R due to memory problems. So, I instead try to load a few (100000) lines at a time (with read.table). However, R kept crashing (with no error message) at about the 6800000 line. This is extremely frustrating. To try to fix this, I used connections with read.table. However, I now get a cryptic error telling me ?no lines available in input?. Is there any way to make this work?
There might be an error in line 42 of your script. Or somewhere else.
The error message is cryptically saying that there were no lines of
text available in the input connection, so presumably the connection
wasn't pointed at your file correctly.
It's hard to guess without seeing what you are doing, but
conn <- file("mybigfile", open="r")
chunk<- read.table(conn, header=TRUE, nrows=10000)
nms <- names(chunk)
while(length(chunk)==10000){
chunk<-read.table(conn, nrows=10000,col.names=nms)
## do something to the chunk
}
close(conn)
should work. This sort of thing certainly does work routinely.
It's probably not worth reading 100,000 lines at a time unless your
computer has a lot of memory. Reducing the chunk size to 10,000
shouldn't introduce much extra overhead and may well increase the
speed by reducing memory use.
-thomas
Thomas Lumley Assoc. Professor, Biostatistics
tlumley at u.washington.edu University of Washington, Seattle
On Wed, 25 Mar 2009, Guillaume Filteau wrote:
Hello Thomas, Thanks for your help! Sadly your code does not work for the last chunk, because its length is shorter than nrows.
You just need to move the test to the bottom of the loop
repeat{
chunk<-read.table(conn, nrows=10000,col.names=nms)
## do something to the chunk
if(length(chunk)<10000) break
}
Quoting Thomas Lumley <tlumley at u.washington.edu>:
On Tue, 24 Mar 2009, Guillaume Filteau wrote:
Hello all, I?m trying to take a huge dataset (1.5 GB) and separate it into smaller chunks with R. So far I had nothing but problems. I cannot load the whole dataset in R due to memory problems. So, I instead try to load a few (100000) lines at a time (with read.table). However, R kept crashing (with no error message) at about the 6800000 line. This is extremely frustrating. To try to fix this, I used connections with read.table. However, I now get a cryptic error telling me ?no lines available in input?. Is there any way to make this work?
There might be an error in line 42 of your script. Or somewhere else. The
error message is cryptically saying that there were no lines of text
available in the input connection, so presumably the connection wasn't
pointed at your file correctly.
It's hard to guess without seeing what you are doing, but
conn <- file("mybigfile", open="r")
chunk<- read.table(conn, header=TRUE, nrows=10000)
nms <- names(chunk)
while(length(chunk)==10000){
chunk<-read.table(conn, nrows=10000,col.names=nms)
## do something to the chunk
}
close(conn)
should work. This sort of thing certainly does work routinely.
It's probably not worth reading 100,000 lines at a time unless your computer
has a lot of memory. Reducing the chunk size to 10,000 shouldn't introduce
much extra overhead and may well increase the speed by reducing memory use.
-thomas
Thomas Lumley Assoc. Professor, Biostatistics
tlumley at u.washington.edu University of Washington, Seattle
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Thomas Lumley Assoc. Professor, Biostatistics tlumley at u.washington.edu University of Washington, Seattle