read.table performance

Hi
system.time(dat<-read.table("test2.txt"))
user  system elapsed 
  32.38    0.00   32.40
system.time(dat <- read.table('test2.txt', nrows=-1, sep='\t', 
header=TRUE))
   user  system elapsed 
  32.30    0.03   32.36 

Couldn't.it be a Windows issue?
               _  
platform       i386-pc-mingw32  
arch           i386  
os             mingw32  
system         i386, mingw32  
status         Under development (unstable)  
major          2  
minor          14.0  
year           2011  
month          04  
day            27  
svn rev        55657  
language       R  
version.string R version 2.14.0 Under development (unstable) (2011-04-27 
r55657)

dim(dat)
[1]    7 3765

But from the dat file it seems to me that its structure is somehow weird.
head(names(dat))
[1] "X..Hydrogen" "Helium"      "Lithium"     "Beryllium"   "Boron" 
[6] "Carbon"
tail(names(dat))
[1] "Sulfur.32"    "Chlorine.32"  "Argon.32"     "Potassium.32" 
"Calcium.32" 
[6] "Scandium.32"

There is row of names which has repeating values. Maybe the most time is 
spent by checking the names validity.

Regards
Petr

r-help-bounces at r-project.org napsal dne 07.12.2011 23:11:10:
peter dalgaard <pdalgd at gmail.com> 
Odeslal: r-help-bounces at r-project.org

07.12.2011 23:11

Komu

"R. Michael Weylandt" <michael.weylandt at gmail.com>

Kopie

r-help at r-project.org, Gene Leynes <gleynes at gmail.com>

P?edm?t

Re: [R] read.table performance

On Dec 7, 2011, at 22:37 , R. Michael Weylandt wrote:

R 2.13.2 on Mac OS X 10.5.8 takes about 1.8s to read the file
verbatim: system.time(read.table("test2.txt"))
About 2.3s with 2.14 on a 1.86 GHz MacBook Air 10.6.8. 

Gene, are you by any chance storing the file in a heavily virus-scanned 
system directory?

-pd

Michael

2011/12/7 Gene Leynes <gleynes at gmail.com>:
Peter,

You're quite right; it's nearly impossible to make progress without a
working example.

I created an ** extremely simplified ** example for distribution. The 
real
data has numeric, character, and boolean classes.

The file still takes 25.08 seconds to read, despite it's small size.

I neglected to mention that I'm using R 2.13.0 and I"m on a windows 7
machine (not that it should particularly matter with this type of 
data /
functions).

## The code:
options(stringsAsFactors=FALSE)
system.time(dat <- read.table('test2.txt', nrows=-1, sep='\t', 
header=TRUE))
str(dat, 0)

Thanks again!

On Wed, Dec 7, 2011 at 1:21 AM, peter dalgaard <pdalgd at gmail.com> 
wrote:

On Dec 6, 2011, at 22:33 , Gene Leynes wrote:

Mark,

Thanks for your suggestions.

That's a good idea about the NULL columns; I didn't think of that.
Surprisingly, it didn't have any effect on the time.
Hmm, I think you want "character" and "NULL" there (i.e., quoted). 
Did you
fix both?

read.table(whatever, as.is=TRUE, colClasses = c(rep(character,4),
rep(NULL,3696)).
As a general matter, if you want people to dig into this, they need 
some
paraphrase of the file to play with. Would it be possible to set up 
a small
R program that generates a data file which displays the issue? 
Everything I
try seems to take about a second to read in.

-pd

This problem was just a curiosity, I already did the import using 
Excel
and
VBA.  I was just going to illustrate the power and simplicity of R, 
but
it
ironically it's been much slower and harder in R...
The VBA was painful and messy, and took me over an hour to write; 
but at
least it worked quickly and reliably.
The R code was clean and only took me about 5 minutes to write, but 
the
run
time was prohibitively slow!

I profiled the code, but that offers little insight to me.

Profile results with 10 line file:

summaryRprof("C:/Users/gene.leynes/Desktop/test.out")
$by.self
            self.time self.pct total.time total.pct
scan             12.24    53.50      12.24     53.50
read.table       10.58    46.24      22.88    100.00
type.convert      0.04     0.17       0.04      0.17
make.names        0.02     0.09       0.02      0.09

$by.total
            total.time total.pct self.time self.pct
read.table        22.88    100.00     10.58    46.24
scan              12.24     53.50     12.24    53.50
type.convert       0.04      0.17      0.04     0.17
make.names         0.02      0.09      0.02     0.09

$sample.interval
[1] 0.02

$sampling.time
[1] 22.88

Profile results with 250 line file:

summaryRprof("C:/Users/gene.leynes/Desktop/test.out")
$by.self
            self.time self.pct total.time total.pct
scan             23.88    68.15      23.88     68.15
read.table       10.78    30.76      35.04    100.00
type.convert      0.30     0.86       0.32      0.91
character         0.02     0.06       0.02      0.06
file              0.02     0.06       0.02      0.06
lapply            0.02     0.06       0.02      0.06
unlist            0.02     0.06       0.02      0.06

$by.total
              total.time total.pct self.time self.pct
read.table          35.04    100.00     10.78    30.76
scan                23.88     68.15     23.88    68.15
type.convert         0.32      0.91      0.30     0.86
sapply               0.04      0.11      0.00     0.00
character            0.02      0.06      0.02     0.06
file                 0.02      0.06      0.02     0.06
lapply               0.02      0.06      0.02     0.06
unlist               0.02      0.06      0.02     0.06
simplify2array       0.02      0.06      0.00     0.00

$sample.interval
[1] 0.02

$sampling.time
[1] 35.04

On Tue, Dec 6, 2011 at 2:34 PM, Mark Leeds <markleeds2 at gmail.com> 
wrote:

hi gene: maybe someone else will reply with some  subtleties that 
I'm
not
aware of. one other thing
that might help: if you know which columns you want , you can set 
the
others to NULL through
colClasses and this should speed things up also. For example, say 
you
knew
you only wanted the
first four columns and they were character. then you could do,

read.table(whatever, as.is=TRUE, colClasses = c(rep(character,4),
rep(NULL,3696)).

hopefully someone else will say something that does the trick. it 
seems
odd to me as far as the
difference in timings ? good luck.

On Tue, Dec 6, 2011 at 1:55 PM, Gene Leynes <gleynes at gmail.com> 
wrote:

Mark,

Thank you for the reply

I neglected to mention that I had already set
options(stringsAsFactors=FALSE)

I agree, skipping the factor determination can help performance.

The main reason that I wanted to use read.table is because it 
will
correctly determine the column classes for me.  I don't really 
want to
specify 3700 column classes!  (I'm not sure what they are 
anyway).

On Tue, Dec 6, 2011 at 12:40 PM, Mark Leeds 
<markleeds2 at gmail.com>
wrote:

Hi Gene: Sometimes using colClasses in read.table can speed 
things up.
If you know what your variables are ahead of time and what you 
want
them to
be, this allows you to be specific  by specifying, character or
numeric,
etc  and often it makes things faster. others will have more to 
say.
also, if most of your variables are characters, R will try to 
turn
convert them into factors by default. If you use as.is = TRUE it
won't
do this and that might speed things up also.

Rejoinder:  above tidbits are  just from experience. I don't 
know if
it's in stone or a hard and fast rule.

On Tue, Dec 6, 2011 at 1:15 PM, Gene Leynes <gleynes at gmail.com>
wrote:

** Disclaimer: I'm looking for general suggestions **
I'm sorry, but can't send out the file I'm using, so there is 
no
reproducible example.

I'm using read.table and it's taking over 30 seconds to read a 
tiny
file.
The strange thing is that it takes roughly the same amount of 
time if
the
file is 100 times larger.

After re-reviewing the data Import / Export manual I think the 
best
approach would be to use Python, or perhaps the readLines 
function,
but
I
was hoping to understand why the simple read.table approach 
wasn't
working
as expected.

Some relevant facts:

 1. There are about 3700 columns.  Maybe this is the problem? 
Still
the

 file size is not very large.
 2. The file encoding is ANSI, but I'm not specifying that in 
the
 function.  Setting fileEncoding="ANSI" produces an 
"unsupported
conversion"
 error
 3. readLines imports the lines quickly
 4. scan imports the file quickly also

Obviously, scan and readLines would require more coding to 
identify
columns, etc.

my code:
system.time(dat <- read.table('C:/test.txt', nrows=-1, 
sep='\t',
header=TRUE))

It's taking 33.4 seconds and the file size is only 315 kb!

Thanks

Gene

      [[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible 
code.

      [[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

--
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

-- 
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

read.table performance

Thread (12 messages)