Skip to content
Prev 174713 / 398506 Next

How to separate huge dataset into chunks

On Tue, 24 Mar 2009, Guillaume Filteau wrote:

            
There might be an error in line 42 of your script. Or somewhere else. The error message is cryptically saying that there were no lines of text available in the input connection, so presumably the connection wasn't pointed at your file correctly.

It's hard to guess without seeing what you are doing, but
    conn <- file("mybigfile", open="r")
    chunk<- read.table(conn, header=TRUE, nrows=10000)
    nms <- names(chunk)
    while(length(chunk)==10000){
       chunk<-read.table(conn, nrows=10000,col.names=nms)
       ## do something to the chunk
    }
    close(conn)

should work. This sort of thing certainly does work routinely.

It's probably not worth reading 100,000 lines at a time unless your computer has a lot of memory. Reducing the chunk size to 10,000 shouldn't introduce much extra overhead and may well increase the speed by reducing memory use.

     -thomas

Thomas Lumley			Assoc. Professor, Biostatistics
tlumley at u.washington.edu	University of Washington, Seattle