read.delim very slow in reading files with lots of columns
Here is how much time it took to read a file with 10 lines and 700,000 columns per line separated with comma:
system.time(input <- scan("/tempxx.txt", what=0, sep=','))
Read 7000000 items user system elapsed 15.62 0.22 15.84
object.size(input)
56000024 bytes
'scan' should be sufficient and it will not take another 10 minutes in awk.
On Fri, Sep 25, 2009 at 1:17 PM, Charles C. Berry <cberry at tajo.ucsd.edu> wrote:
On Fri, 25 Sep 2009, Ping-Hsun Hsieh wrote:
Thanks, Ben. The matrix is a pure numeric matrix (6x700000, 31mb). I tried the colClasses='numeric' as well as nrows=7(one of these is header line) on the matrix. Also I tested it with not setting the two options in read.delim()
A couple of things come to mind. First, I have not read the internals of scan, but suspect that parsing a really long line may be slowing things down. Since you are attempting to read in a numeric matrix, you can simply do a global replacement of your delimiter with a newline and use scan on the result. On unix-like systems, something like ? ? ? ?tmp <- scan( pipe( 'tr "\t" "\n" ?< test_data.txt' ) ) ought to help. Second, the memory occupied by each line - once it has been processed - is spread over the full 32MB (or 3.2 GB for the 600 by 700000 version) region of memory. I am guessing that this is causing your cache to work hard to put it in place. If you really want the result to be a 600 by 700000 matrix, you might try to read it in smaller blocks using scan( pipe( "cut ... " ) ) to feed selected blocks of columns of your text file to R. HTH, Chuck
Here is the time spent on reading the matrix for each test.
system.time( tmp <- read.delim("test_data.txt"))
? ?user ? ?system ? elapsed 50985.421 ? ?27.665 51013.384
system.time(tmp <-
read.delim("test_data.txt",colClasses="numeric",nrows=7,comment.char=""))
? ?user ? ?system ? elapsed 51301.563 ? ?60.491 51362.208 It seems setting the options does not speed up the reading at all. Is it because of the header line? I will test it. Did I misunderstand something? One additional and interesting observation: The one with the options does save memory a lot. It took ~150mb, while the other took ~4GB for reading the matrix. I will try the scan() and see if it helps. Thanks! Mike -----Original Message----- From: Benilton Carvalho [mailto:bcarvalh at jhsph.edu] Sent: Wednesday, September 23, 2009 4:56 PM To: Ping-Hsun Hsieh Cc: r-help at r-project.org Subject: Re: [R] read.delim very slow in reading files with lots of columns use the 'colClasses' argument and you can also set 'nrows'. b On Sep 23, 2009, at 8:24 PM, Ping-Hsun Hsieh wrote:
Hi, I am trying to read a tab-delimited file into R (Ver. 2.8). The machine I am using is 64bit Linux with 16 GB. The file is basically a matrix(~600x700000) and as large as 3GB. The read.delim() ran extremely slow (hours) even with a subset of the file (31 MB with 6x700000) I monitored the memory usage, and found it constantly only took less than 1% of 16GB memory. Does read.delim() have difficulty to read files with lots of columns? Any suggestions? Thanks, Mike ? ? ? [[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Charles C. Berry ? ? ? ? ? ? ? ? ? ? ? ? ? ?(858) 534-2098 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Dept of Family/Preventive Medicine E mailto:cberry at tajo.ucsd.edu ? ? ? ? ? ? ? UC San Diego http://famprevmed.ucsd.edu/faculty/cberry/ ?La Jolla, San Diego 92093-0901
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem that you are trying to solve?