Creating a custom connection to read from multiple files - R-help

Thu, Jan 20, 2005 12:59 AM #

Hello,

is it possible to create my own connection which I could use with
read.table or scan ? I would like to create a connection that would read
from multiple files in sequence (like if they were concatenated),
possibly with an option to skip first n lines of each file. I would like
to avoid using platform specific scripts for that... (currently I invoke
"/bin/cat" from R to create a concatenation of all those files).

Thanks,

Tomas

Brian Ripley

Thu, Jan 20, 2005 1:11 AM #

On Thu, 20 Jan 2005, Tomas Kalibera wrote:

Yes.  In a sense, all the connections are custom connections written by 
someone.

I would use pipes, but a pure R solution is to process the files to an 
anonymous file() connection and then read that.

However, what is wrong with reading a file at a time and combining the 
results in R using rbind?

Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

Tomas Kalibera

Thu, Jan 20, 2005 2:32 AM #

Dear Prof Ripley,

thanks for your suggestions, it's very nice one can create custom 
connections directly in R and I think it is what I need just now.

Well, the problem is performance. If I concatenate all those files, they 
have around 8MB, can grow to tens of MBs in near future.

Both concatenating and reading from a single file by scan takes 5 
seconds (which is almost OK).

However, reading individual files by read.table and rbinding one by one 
( samples=rbind(samples, newSamples ) takes minutes. The same is when I 
concatenate lists manually. Scan does not help significantly. I guess 
there is some overhead in detecting dimensions of objects in rbind (?) 
or re-allocation or copying data ?

Best regards,

Tomas Kalibera

Brian Ripley

Thu, Jan 20, 2005 2:58 AM #

On Thu, 20 Jan 2005, Tomas Kalibera wrote:

rbind is vectorized so you are using it (way) suboptimally.

Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595