read.table performance
Here is a test that I ran where the difference was rather the data was in a single column or 3700 columns. If in a single column, the 'scan' and 'read.table' were comparable; with 3700 columns, read.table took 3X longer. using 'colClasses' did not make a difference:
x.n <- as.character(runif(3700)) x.f <- tempfile() # just write out a file of numbers in a single column # 3700 * 500 = 1.85M lines writeLines(rep(x.n, 500), con = x.f) file.info(x.f)
size isdir mode mtime
C:\\Users\\Owner\\AppData\\Local\\Temp\\RtmpOWGkEu\\file60a82064
35154500 FALSE 666 2011-12-07 06:13:56
ctime atime
C:\\Users\\Owner\\AppData\\Local\\Temp\\RtmpOWGkEu\\file60a82064
2011-12-07 06:13:52 2011-12-07 06:13:52
exe
C:\\Users\\Owner\\AppData\\Local\\Temp\\RtmpOWGkEu\\file60a82064 no
system.time(x.n.read <- scan(x.f))
Read 1850000 items user system elapsed 4.04 0.05 4.10
dim(x.n.read)
NULL
object.size(x.n.read)
14800040 bytes
system.time(x.n.read <- read.table(x.f)) # comparible to 'scan'
user system elapsed 4.68 0.06 4.74
object.size(x.n.read)
14800672 bytes
# now create data with 3700 columns # and 500 rows (1.85M numbers) x.long <- paste(x.n, collapse = ',') writeLines(rep(x.long, 500), con = x.f) file.info(x.f)
size isdir mode mtime
C:\\Users\\Owner\\AppData\\Local\\Temp\\RtmpOWGkEu\\file60a82064
33305000 FALSE 666 2011-12-07 06:14:11
ctime atime
C:\\Users\\Owner\\AppData\\Local\\Temp\\RtmpOWGkEu\\file60a82064
2011-12-07 06:13:52 2011-12-07 06:13:52
C:\\Users\\Owner\\AppData\\Local\\Temp\\RtmpOWGkEu\\file60a82064 no
system.time(x.long.read <- scan(x.f, sep = ','))
Read 1850000 items user system elapsed 4.21 0.02 4.23
dim(x.long.read)
NULL
object.size(x.long.read)
14800040 bytes
# takes 3 times as long as 'scan' system.time(x.long.read <- read.table(x.f, sep = ','))
user system elapsed 13.24 0.06 13.33
dim(x.long.read)
[1] 500 3700
object.size(x.long.read)
15185368 bytes
# using colClasses system.time(x.long.read <- read.table(x.f, sep = ','
+ , colClasses = rep('numeric', 3700)
+ )
+ )
user system elapsed
12.39 0.06 12.48
On Tue, Dec 6, 2011 at 4:33 PM, Gene Leynes <gleynes at gmail.com> wrote:
Mark, Thanks for your suggestions. That's a good idea about the NULL columns; I didn't think of that. Surprisingly, it didn't have any effect on the time. This problem was just a curiosity, I already did the import using Excel and VBA. ?I was just going to illustrate the power and simplicity of R, but it ironically it's been much slower and harder in R... The VBA was painful and messy, and took me over an hour to write; but at least it worked quickly and reliably. The R code was clean and only took me about 5 minutes to write, but the run time was prohibitively slow! I profiled the code, but that offers little insight to me. Profile results with 10 line file:
summaryRprof("C:/Users/gene.leynes/Desktop/test.out")
$by.self ? ? ? ? ? ? self.time self.pct total.time total.pct scan ? ? ? ? ? ? 12.24 ? ?53.50 ? ? ?12.24 ? ? 53.50 read.table ? ? ? 10.58 ? ?46.24 ? ? ?22.88 ? ?100.00 type.convert ? ? ?0.04 ? ? 0.17 ? ? ? 0.04 ? ? ?0.17 make.names ? ? ? ?0.02 ? ? 0.09 ? ? ? 0.02 ? ? ?0.09 $by.total ? ? ? ? ? ? total.time total.pct self.time self.pct read.table ? ? ? ?22.88 ? ?100.00 ? ? 10.58 ? ?46.24 scan ? ? ? ? ? ? ?12.24 ? ? 53.50 ? ? 12.24 ? ?53.50 type.convert ? ? ? 0.04 ? ? ?0.17 ? ? ?0.04 ? ? 0.17 make.names ? ? ? ? 0.02 ? ? ?0.09 ? ? ?0.02 ? ? 0.09 $sample.interval [1] 0.02 $sampling.time [1] 22.88 Profile results with 250 line file:
summaryRprof("C:/Users/gene.leynes/Desktop/test.out")
$by.self ? ? ? ? ? ? self.time self.pct total.time total.pct scan ? ? ? ? ? ? 23.88 ? ?68.15 ? ? ?23.88 ? ? 68.15 read.table ? ? ? 10.78 ? ?30.76 ? ? ?35.04 ? ?100.00 type.convert ? ? ?0.30 ? ? 0.86 ? ? ? 0.32 ? ? ?0.91 character ? ? ? ? 0.02 ? ? 0.06 ? ? ? 0.02 ? ? ?0.06 file ? ? ? ? ? ? ?0.02 ? ? 0.06 ? ? ? 0.02 ? ? ?0.06 lapply ? ? ? ? ? ?0.02 ? ? 0.06 ? ? ? 0.02 ? ? ?0.06 unlist ? ? ? ? ? ?0.02 ? ? 0.06 ? ? ? 0.02 ? ? ?0.06 $by.total ? ? ? ? ? ? ? total.time total.pct self.time self.pct read.table ? ? ? ? ?35.04 ? ?100.00 ? ? 10.78 ? ?30.76 scan ? ? ? ? ? ? ? ?23.88 ? ? 68.15 ? ? 23.88 ? ?68.15 type.convert ? ? ? ? 0.32 ? ? ?0.91 ? ? ?0.30 ? ? 0.86 sapply ? ? ? ? ? ? ? 0.04 ? ? ?0.11 ? ? ?0.00 ? ? 0.00 character ? ? ? ? ? ?0.02 ? ? ?0.06 ? ? ?0.02 ? ? 0.06 file ? ? ? ? ? ? ? ? 0.02 ? ? ?0.06 ? ? ?0.02 ? ? 0.06 lapply ? ? ? ? ? ? ? 0.02 ? ? ?0.06 ? ? ?0.02 ? ? 0.06 unlist ? ? ? ? ? ? ? 0.02 ? ? ?0.06 ? ? ?0.02 ? ? 0.06 simplify2array ? ? ? 0.02 ? ? ?0.06 ? ? ?0.00 ? ? 0.00 $sample.interval [1] 0.02 $sampling.time [1] 35.04 On Tue, Dec 6, 2011 at 2:34 PM, Mark Leeds <markleeds2 at gmail.com> wrote:
hi gene: maybe someone else will reply with some ?subtleties that I'm not aware of. one other thing that might help: if you know which columns you want , you can set the others to NULL through colClasses and this should speed things up also. For example, say you knew you only wanted the first four columns and they were character. then you could do, read.table(whatever, as.is=TRUE, colClasses = c(rep(character,4), rep(NULL,3696)). hopefully someone else will say something that does the trick. it seems odd to me as far as the difference in timings ? good luck. On Tue, Dec 6, 2011 at 1:55 PM, Gene Leynes <gleynes at gmail.com> wrote:
Mark, Thank you for the reply I neglected to mention that I had already set options(stringsAsFactors=FALSE) I agree, skipping the factor determination can help performance. The main reason that I wanted to use read.table is because it will correctly determine the column classes for me. ?I don't really want to specify 3700 column classes! ?(I'm not sure what they are anyway). On Tue, Dec 6, 2011 at 12:40 PM, Mark Leeds <markleeds2 at gmail.com> wrote:
Hi Gene: Sometimes using colClasses in read.table can speed things up. If you know what your variables are ahead of time and what you want them to be, this allows you to be specific ?by specifying, character or numeric, etc ?and often it makes things faster. others will have more to say. also, if most of your variables are characters, R will try to turn convert them into factors by default. If you use as.is = TRUE it won't do this and that might speed things up also. Rejoinder: ?above tidbits are ?just from experience. I don't know if it's in stone or a hard and fast rule. On Tue, Dec 6, 2011 at 1:15 PM, Gene Leynes <gleynes at gmail.com> wrote:
** Disclaimer: I'm looking for general suggestions **
I'm sorry, but can't send out the file I'm using, so there is no
reproducible example.
I'm using read.table and it's taking over 30 seconds to read a tiny
file.
The strange thing is that it takes roughly the same amount of time if
the
file is 100 times larger.
After re-reviewing the data Import / Export manual I think the best
approach would be to use Python, or perhaps the readLines function, but
I
was hoping to understand why the simple read.table approach wasn't
working
as expected.
Some relevant facts:
? 1. There are about 3700 columns. ?Maybe this is the problem? ?Still
the
? file size is not very large.
? 2. The file encoding is ANSI, but I'm not specifying that in the
? function. ?Setting fileEncoding="ANSI" produces an "unsupported
conversion"
? error
? 3. readLines imports the lines quickly
? 4. scan imports the file quickly also
Obviously, scan and readLines would require more coding to identify
columns, etc.
my code:
system.time(dat <- read.table('C:/test.txt', nrows=-1, sep='\t',
header=TRUE))
It's taking 33.4 seconds and the file size is only 315 kb!
Thanks
Gene
? ? ? ?[[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
? ? ? ?[[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Jim Holtman Data Munger Guru What is the problem that you are trying to solve? Tell me what you want to do, not how you want to do it.