Back to formatted view
Raw Message

Message-ID: <CAAmySGNnNS9M6M6YMEOpXu3JCJq=7hr2SQ=CWQtG6siP0u3PkQ@mail.gmail.com>
Date: 2011-12-07T21:37:59Z
From: R. Michael Weylandt
Subject: read.table performance
In-Reply-To: <CAOBARVj5dyN3168NYPffagzGOZYejWS_V=t31fPKFtC454H_RQ@mail.gmail.com>

R 2.13.2 on Mac OS X 10.5.8 takes about 1.8s to read the file
verbatim: system.time(read.table("test2.txt"))

Michael

2011/12/7 Gene Leynes <gleynes at gmail.com>:
> Peter,
>
> You're quite right; it's nearly impossible to make progress without a
> working example.
>
> I created an ** extremely simplified ** example for distribution. ?The real
> data has numeric, character, and boolean classes.
>
> The file still takes 25.08 seconds to read, despite it's small size.
>
> I neglected to mention that I'm using R 2.13.0 and I"m on a windows 7
> machine (not that it should particularly matter with this type of data /
> functions).
>
> ## The code:
> options(stringsAsFactors=FALSE)
> system.time(dat <- read.table('test2.txt', nrows=-1, sep='\t', header=TRUE))
> str(dat, 0)
>
>
> Thanks again!
>
>
>
> On Wed, Dec 7, 2011 at 1:21 AM, peter dalgaard <pdalgd at gmail.com> wrote:
>
>>
>> On Dec 6, 2011, at 22:33 , Gene Leynes wrote:
>>
>> > Mark,
>> >
>> > Thanks for your suggestions.
>> >
>> > That's a good idea about the NULL columns; I didn't think of that.
>> > Surprisingly, it didn't have any effect on the time.
>>
>> Hmm, I think you want "character" and "NULL" there (i.e., quoted). Did you
>> fix both?
>>
>> >> read.table(whatever, as.is=TRUE, colClasses = c(rep(character,4),
>> >> rep(NULL,3696)).
>>
>> As a general matter, if you want people to dig into this, they need some
>> paraphrase of the file to play with. Would it be possible to set up a small
>> R program that generates a data file which displays the issue? Everything I
>> try seems to take about a second to read in.
>>
>> -pd
>>
>> >
>> > This problem was just a curiosity, I already did the import using Excel
>> and
>> > VBA. ?I was just going to illustrate the power and simplicity of R, but
>> it
>> > ironically it's been much slower and harder in R...
>> > The VBA was painful and messy, and took me over an hour to write; but at
>> > least it worked quickly and reliably.
>> > The R code was clean and only took me about 5 minutes to write, but the
>> run
>> > time was prohibitively slow!
>> >
>> > I profiled the code, but that offers little insight to me.
>> >
>> > Profile results with 10 line file:
>> >
>> >> summaryRprof("C:/Users/gene.leynes/Desktop/test.out")
>> > $by.self
>> > ? ? ? ? ? ? self.time self.pct total.time total.pct
>> > scan ? ? ? ? ? ? 12.24 ? ?53.50 ? ? ?12.24 ? ? 53.50
>> > read.table ? ? ? 10.58 ? ?46.24 ? ? ?22.88 ? ?100.00
>> > type.convert ? ? ?0.04 ? ? 0.17 ? ? ? 0.04 ? ? ?0.17
>> > make.names ? ? ? ?0.02 ? ? 0.09 ? ? ? 0.02 ? ? ?0.09
>> >
>> > $by.total
>> > ? ? ? ? ? ? total.time total.pct self.time self.pct
>> > read.table ? ? ? ?22.88 ? ?100.00 ? ? 10.58 ? ?46.24
>> > scan ? ? ? ? ? ? ?12.24 ? ? 53.50 ? ? 12.24 ? ?53.50
>> > type.convert ? ? ? 0.04 ? ? ?0.17 ? ? ?0.04 ? ? 0.17
>> > make.names ? ? ? ? 0.02 ? ? ?0.09 ? ? ?0.02 ? ? 0.09
>> >
>> > $sample.interval
>> > [1] 0.02
>> >
>> > $sampling.time
>> > [1] 22.88
>> >
>> >
>> > Profile results with 250 line file:
>> >
>> >> summaryRprof("C:/Users/gene.leynes/Desktop/test.out")
>> > $by.self
>> > ? ? ? ? ? ? self.time self.pct total.time total.pct
>> > scan ? ? ? ? ? ? 23.88 ? ?68.15 ? ? ?23.88 ? ? 68.15
>> > read.table ? ? ? 10.78 ? ?30.76 ? ? ?35.04 ? ?100.00
>> > type.convert ? ? ?0.30 ? ? 0.86 ? ? ? 0.32 ? ? ?0.91
>> > character ? ? ? ? 0.02 ? ? 0.06 ? ? ? 0.02 ? ? ?0.06
>> > file ? ? ? ? ? ? ?0.02 ? ? 0.06 ? ? ? 0.02 ? ? ?0.06
>> > lapply ? ? ? ? ? ?0.02 ? ? 0.06 ? ? ? 0.02 ? ? ?0.06
>> > unlist ? ? ? ? ? ?0.02 ? ? 0.06 ? ? ? 0.02 ? ? ?0.06
>> >
>> > $by.total
>> > ? ? ? ? ? ? ? total.time total.pct self.time self.pct
>> > read.table ? ? ? ? ?35.04 ? ?100.00 ? ? 10.78 ? ?30.76
>> > scan ? ? ? ? ? ? ? ?23.88 ? ? 68.15 ? ? 23.88 ? ?68.15
>> > type.convert ? ? ? ? 0.32 ? ? ?0.91 ? ? ?0.30 ? ? 0.86
>> > sapply ? ? ? ? ? ? ? 0.04 ? ? ?0.11 ? ? ?0.00 ? ? 0.00
>> > character ? ? ? ? ? ?0.02 ? ? ?0.06 ? ? ?0.02 ? ? 0.06
>> > file ? ? ? ? ? ? ? ? 0.02 ? ? ?0.06 ? ? ?0.02 ? ? 0.06
>> > lapply ? ? ? ? ? ? ? 0.02 ? ? ?0.06 ? ? ?0.02 ? ? 0.06
>> > unlist ? ? ? ? ? ? ? 0.02 ? ? ?0.06 ? ? ?0.02 ? ? 0.06
>> > simplify2array ? ? ? 0.02 ? ? ?0.06 ? ? ?0.00 ? ? 0.00
>> >
>> > $sample.interval
>> > [1] 0.02
>> >
>> > $sampling.time
>> > [1] 35.04
>> >
>> >
>> >
>> >
>> > On Tue, Dec 6, 2011 at 2:34 PM, Mark Leeds <markleeds2 at gmail.com> wrote:
>> >
>> >> hi gene: maybe someone else will reply with some ?subtleties that I'm
>> not
>> >> aware of. one other thing
>> >> that might help: if you know which columns you want , you can set the
>> >> others to NULL through
>> >> colClasses and this should speed things up also. For example, say you
>> knew
>> >> you only wanted the
>> >> first four columns and they were character. then you could do,
>> >>
>> >> read.table(whatever, as.is=TRUE, colClasses = c(rep(character,4),
>> >> rep(NULL,3696)).
>> >>
>> >> hopefully someone else will say something that does the trick. it seems
>> >> odd to me as far as the
>> >> difference in timings ? good luck.
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> On Tue, Dec 6, 2011 at 1:55 PM, Gene Leynes <gleynes at gmail.com> wrote:
>> >>
>> >>> Mark,
>> >>>
>> >>> Thank you for the reply
>> >>>
>> >>> I neglected to mention that I had already set
>> >>> options(stringsAsFactors=FALSE)
>> >>>
>> >>> I agree, skipping the factor determination can help performance.
>> >>>
>> >>> The main reason that I wanted to use read.table is because it will
>> >>> correctly determine the column classes for me. ?I don't really want to
>> >>> specify 3700 column classes! ?(I'm not sure what they are anyway).
>> >>>
>> >>>
>> >>> On Tue, Dec 6, 2011 at 12:40 PM, Mark Leeds <markleeds2 at gmail.com>
>> wrote:
>> >>>
>> >>>> Hi Gene: Sometimes using colClasses in read.table can speed things up.
>> >>>> If you know what your variables are ahead of time and what you want
>> them to
>> >>>> be, this allows you to be specific ?by specifying, character or
>> numeric,
>> >>>> etc ?and often it makes things faster. others will have more to say.
>> >>>>
>> >>>> also, if most of your variables are characters, R will try to turn
>> >>>> convert them into factors by default. If you use as.is = TRUE it
>> won't
>> >>>> do this and that might speed things up also.
>> >>>>
>> >>>>
>> >>>> Rejoinder: ?above tidbits are ?just from experience. I don't know if
>> >>>> it's in stone or a hard and fast rule.
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>> On Tue, Dec 6, 2011 at 1:15 PM, Gene Leynes <gleynes at gmail.com>
>> wrote:
>> >>>>
>> >>>>> ** Disclaimer: I'm looking for general suggestions **
>> >>>>> I'm sorry, but can't send out the file I'm using, so there is no
>> >>>>> reproducible example.
>> >>>>>
>> >>>>> I'm using read.table and it's taking over 30 seconds to read a tiny
>> >>>>> file.
>> >>>>> The strange thing is that it takes roughly the same amount of time if
>> >>>>> the
>> >>>>> file is 100 times larger.
>> >>>>>
>> >>>>> After re-reviewing the data Import / Export manual I think the best
>> >>>>> approach would be to use Python, or perhaps the readLines function,
>> but
>> >>>>> I
>> >>>>> was hoping to understand why the simple read.table approach wasn't
>> >>>>> working
>> >>>>> as expected.
>> >>>>>
>> >>>>> Some relevant facts:
>> >>>>>
>> >>>>> ?1. There are about 3700 columns. ?Maybe this is the problem? ?Still
>> >>>>> the
>> >>>>>
>> >>>>> ?file size is not very large.
>> >>>>> ?2. The file encoding is ANSI, but I'm not specifying that in the
>> >>>>>
>> >>>>> ?function. ?Setting fileEncoding="ANSI" produces an "unsupported
>> >>>>> conversion"
>> >>>>> ?error
>> >>>>> ?3. readLines imports the lines quickly
>> >>>>> ?4. scan imports the file quickly also
>> >>>>>
>> >>>>>
>> >>>>> Obviously, scan and readLines would require more coding to identify
>> >>>>> columns, etc.
>> >>>>>
>> >>>>> my code:
>> >>>>> system.time(dat <- read.table('C:/test.txt', nrows=-1, sep='\t',
>> >>>>> header=TRUE))
>> >>>>>
>> >>>>> It's taking 33.4 seconds and the file size is only 315 kb!
>> >>>>>
>> >>>>> Thanks
>> >>>>>
>> >>>>> Gene
>> >>>>>
>> >>>>> ? ? ? [[alternative HTML version deleted]]
>> >>>>>
>> >>>>> ______________________________________________
>> >>>>> R-help at r-project.org mailing list
>> >>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>> >>>>> PLEASE do read the posting guide
>> >>>>> http://www.R-project.org/posting-guide.html
>> >>>>> and provide commented, minimal, self-contained, reproducible code.
>> >>>>>
>> >>>>
>> >>>>
>> >>>
>> >>
>> >
>> > ? ? ? [[alternative HTML version deleted]]
>> >
>> > ______________________________________________
>> > R-help at r-project.org mailing list
>> > https://stat.ethz.ch/mailman/listinfo/r-help
>> > PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> > and provide commented, minimal, self-contained, reproducible code.
>>
>> --
>> Peter Dalgaard, Professor,
>> Center for Statistics, Copenhagen Business School
>> Solbjerg Plads 3, 2000 Frederiksberg, Denmark
>> Phone: (+45)38153501
>> Email: pd.mes at cbs.dk ?Priv: PDalgd at gmail.com
>>
>>
>>
>>
>>
>>
>>
>>
>>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>