Message-ID: <CAAxdm-7P7Q_kTSgor_Aw1PW2OchgfrSDWy-6SA-046KVaNxUZA@mail.gmail.com>
Date: 2011-12-07T11:18:29Z
From: jim holtman
Subject: read.table performance
In-Reply-To: <CAOBARVj4qj86xQtSf32HW=oHaPC6HHbOptS_bkRcpj69T3uyxQ@mail.gmail.com>
Here is a test that I ran where the difference was rather the data was
in a single column or 3700 columns. If in a single column, the 'scan'
and 'read.table' were comparable; with 3700 columns, read.table took
3X longer. using 'colClasses' did not make a difference:
> x.n <- as.character(runif(3700))
> x.f <- tempfile()
> # just write out a file of numbers in a single column
> # 3700 * 500 = 1.85M lines
> writeLines(rep(x.n, 500), con = x.f)
> file.info(x.f)
size isdir mode mtime
C:\\Users\\Owner\\AppData\\Local\\Temp\\RtmpOWGkEu\\file60a82064
35154500 FALSE 666 2011-12-07 06:13:56
ctime atime
C:\\Users\\Owner\\AppData\\Local\\Temp\\RtmpOWGkEu\\file60a82064
2011-12-07 06:13:52 2011-12-07 06:13:52
exe
C:\\Users\\Owner\\AppData\\Local\\Temp\\RtmpOWGkEu\\file60a82064 no
> system.time(x.n.read <- scan(x.f))
Read 1850000 items
user system elapsed
4.04 0.05 4.10
> dim(x.n.read)
NULL
> object.size(x.n.read)
14800040 bytes
> system.time(x.n.read <- read.table(x.f)) # comparible to 'scan'
user system elapsed
4.68 0.06 4.74
> object.size(x.n.read)
14800672 bytes
>
> # now create data with 3700 columns
> # and 500 rows (1.85M numbers)
> x.long <- paste(x.n, collapse = ',')
> writeLines(rep(x.long, 500), con = x.f)
> file.info(x.f)
size isdir mode mtime
C:\\Users\\Owner\\AppData\\Local\\Temp\\RtmpOWGkEu\\file60a82064
33305000 FALSE 666 2011-12-07 06:14:11
ctime atime
C:\\Users\\Owner\\AppData\\Local\\Temp\\RtmpOWGkEu\\file60a82064
2011-12-07 06:13:52 2011-12-07 06:13:52
C:\\Users\\Owner\\AppData\\Local\\Temp\\RtmpOWGkEu\\file60a82064 no
> system.time(x.long.read <- scan(x.f, sep = ','))
Read 1850000 items
user system elapsed
4.21 0.02 4.23
> dim(x.long.read)
NULL
> object.size(x.long.read)
14800040 bytes
> # takes 3 times as long as 'scan'
> system.time(x.long.read <- read.table(x.f, sep = ','))
user system elapsed
13.24 0.06 13.33
> dim(x.long.read)
[1] 500 3700
> object.size(x.long.read)
15185368 bytes
>
>
> # using colClasses
> system.time(x.long.read <- read.table(x.f, sep = ','
+ , colClasses = rep('numeric', 3700)
+ )
+ )
user system elapsed
12.39 0.06 12.48
>
>
On Tue, Dec 6, 2011 at 4:33 PM, Gene Leynes <gleynes at gmail.com> wrote:
> Mark,
>
> Thanks for your suggestions.
>
> That's a good idea about the NULL columns; I didn't think of that.
> Surprisingly, it didn't have any effect on the time.
>
> This problem was just a curiosity, I already did the import using Excel and
> VBA. ?I was just going to illustrate the power and simplicity of R, but it
> ironically it's been much slower and harder in R...
> The VBA was painful and messy, and took me over an hour to write; but at
> least it worked quickly and reliably.
> The R code was clean and only took me about 5 minutes to write, but the run
> time was prohibitively slow!
>
> I profiled the code, but that offers little insight to me.
>
> Profile results with 10 line file:
>
>> summaryRprof("C:/Users/gene.leynes/Desktop/test.out")
> $by.self
> ? ? ? ? ? ? self.time self.pct total.time total.pct
> scan ? ? ? ? ? ? 12.24 ? ?53.50 ? ? ?12.24 ? ? 53.50
> read.table ? ? ? 10.58 ? ?46.24 ? ? ?22.88 ? ?100.00
> type.convert ? ? ?0.04 ? ? 0.17 ? ? ? 0.04 ? ? ?0.17
> make.names ? ? ? ?0.02 ? ? 0.09 ? ? ? 0.02 ? ? ?0.09
>
> $by.total
> ? ? ? ? ? ? total.time total.pct self.time self.pct
> read.table ? ? ? ?22.88 ? ?100.00 ? ? 10.58 ? ?46.24
> scan ? ? ? ? ? ? ?12.24 ? ? 53.50 ? ? 12.24 ? ?53.50
> type.convert ? ? ? 0.04 ? ? ?0.17 ? ? ?0.04 ? ? 0.17
> make.names ? ? ? ? 0.02 ? ? ?0.09 ? ? ?0.02 ? ? 0.09
>
> $sample.interval
> [1] 0.02
>
> $sampling.time
> [1] 22.88
>
>
> Profile results with 250 line file:
>
>> summaryRprof("C:/Users/gene.leynes/Desktop/test.out")
> $by.self
> ? ? ? ? ? ? self.time self.pct total.time total.pct
> scan ? ? ? ? ? ? 23.88 ? ?68.15 ? ? ?23.88 ? ? 68.15
> read.table ? ? ? 10.78 ? ?30.76 ? ? ?35.04 ? ?100.00
> type.convert ? ? ?0.30 ? ? 0.86 ? ? ? 0.32 ? ? ?0.91
> character ? ? ? ? 0.02 ? ? 0.06 ? ? ? 0.02 ? ? ?0.06
> file ? ? ? ? ? ? ?0.02 ? ? 0.06 ? ? ? 0.02 ? ? ?0.06
> lapply ? ? ? ? ? ?0.02 ? ? 0.06 ? ? ? 0.02 ? ? ?0.06
> unlist ? ? ? ? ? ?0.02 ? ? 0.06 ? ? ? 0.02 ? ? ?0.06
>
> $by.total
> ? ? ? ? ? ? ? total.time total.pct self.time self.pct
> read.table ? ? ? ? ?35.04 ? ?100.00 ? ? 10.78 ? ?30.76
> scan ? ? ? ? ? ? ? ?23.88 ? ? 68.15 ? ? 23.88 ? ?68.15
> type.convert ? ? ? ? 0.32 ? ? ?0.91 ? ? ?0.30 ? ? 0.86
> sapply ? ? ? ? ? ? ? 0.04 ? ? ?0.11 ? ? ?0.00 ? ? 0.00
> character ? ? ? ? ? ?0.02 ? ? ?0.06 ? ? ?0.02 ? ? 0.06
> file ? ? ? ? ? ? ? ? 0.02 ? ? ?0.06 ? ? ?0.02 ? ? 0.06
> lapply ? ? ? ? ? ? ? 0.02 ? ? ?0.06 ? ? ?0.02 ? ? 0.06
> unlist ? ? ? ? ? ? ? 0.02 ? ? ?0.06 ? ? ?0.02 ? ? 0.06
> simplify2array ? ? ? 0.02 ? ? ?0.06 ? ? ?0.00 ? ? 0.00
>
> $sample.interval
> [1] 0.02
>
> $sampling.time
> [1] 35.04
>
>
>
>
> On Tue, Dec 6, 2011 at 2:34 PM, Mark Leeds <markleeds2 at gmail.com> wrote:
>
>> hi gene: maybe someone else will reply with some ?subtleties that I'm not
>> aware of. one other thing
>> that might help: if you know which columns you want , you can set the
>> others to NULL through
>> colClasses and this should speed things up also. For example, say you knew
>> you only wanted the
>> first four columns and they were character. then you could do,
>>
>> read.table(whatever, as.is=TRUE, colClasses = c(rep(character,4),
>> rep(NULL,3696)).
>>
>> hopefully someone else will say something that does the trick. it seems
>> odd to me as far as the
>> difference in timings ? good luck.
>>
>>
>>
>>
>>
>> On Tue, Dec 6, 2011 at 1:55 PM, Gene Leynes <gleynes at gmail.com> wrote:
>>
>>> Mark,
>>>
>>> Thank you for the reply
>>>
>>> I neglected to mention that I had already set
>>> options(stringsAsFactors=FALSE)
>>>
>>> I agree, skipping the factor determination can help performance.
>>>
>>> The main reason that I wanted to use read.table is because it will
>>> correctly determine the column classes for me. ?I don't really want to
>>> specify 3700 column classes! ?(I'm not sure what they are anyway).
>>>
>>>
>>> On Tue, Dec 6, 2011 at 12:40 PM, Mark Leeds <markleeds2 at gmail.com> wrote:
>>>
>>>> Hi Gene: Sometimes using colClasses in read.table can speed things up.
>>>> If you know what your variables are ahead of time and what you want them to
>>>> be, this allows you to be specific ?by specifying, character or numeric,
>>>> etc ?and often it makes things faster. others will have more to say.
>>>>
>>>> also, if most of your variables are characters, R will try to turn
>>>> convert them into factors by default. If you use as.is = TRUE it won't
>>>> do this and that might speed things up also.
>>>>
>>>>
>>>> Rejoinder: ?above tidbits are ?just from experience. I don't know if
>>>> it's in stone or a hard and fast rule.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Dec 6, 2011 at 1:15 PM, Gene Leynes <gleynes at gmail.com> wrote:
>>>>
>>>>> ** Disclaimer: I'm looking for general suggestions **
>>>>> I'm sorry, but can't send out the file I'm using, so there is no
>>>>> reproducible example.
>>>>>
>>>>> I'm using read.table and it's taking over 30 seconds to read a tiny
>>>>> file.
>>>>> The strange thing is that it takes roughly the same amount of time if
>>>>> the
>>>>> file is 100 times larger.
>>>>>
>>>>> After re-reviewing the data Import / Export manual I think the best
>>>>> approach would be to use Python, or perhaps the readLines function, but
>>>>> I
>>>>> was hoping to understand why the simple read.table approach wasn't
>>>>> working
>>>>> as expected.
>>>>>
>>>>> Some relevant facts:
>>>>>
>>>>> ? 1. There are about 3700 columns. ?Maybe this is the problem? ?Still
>>>>> the
>>>>>
>>>>> ? file size is not very large.
>>>>> ? 2. The file encoding is ANSI, but I'm not specifying that in the
>>>>>
>>>>> ? function. ?Setting fileEncoding="ANSI" produces an "unsupported
>>>>> conversion"
>>>>> ? error
>>>>> ? 3. readLines imports the lines quickly
>>>>> ? 4. scan imports the file quickly also
>>>>>
>>>>>
>>>>> Obviously, scan and readLines would require more coding to identify
>>>>> columns, etc.
>>>>>
>>>>> my code:
>>>>> system.time(dat <- read.table('C:/test.txt', nrows=-1, sep='\t',
>>>>> header=TRUE))
>>>>>
>>>>> It's taking 33.4 seconds and the file size is only 315 kb!
>>>>>
>>>>> Thanks
>>>>>
>>>>> Gene
>>>>>
>>>>> ? ? ? ?[[alternative HTML version deleted]]
>>>>>
>>>>> ______________________________________________
>>>>> R-help at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> PLEASE do read the posting guide
>>>>> http://www.R-project.org/posting-guide.html
>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>
>>>>
>>>>
>>>
>>
>
> ? ? ? ?[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
--
Jim Holtman
Data Munger Guru
What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.