Skip to content

Huge data frames?

4 messages · Magnus Lie Hetland, Ott Toomet, Brian Ripley +1 more

#
A friend of mine recently mentioned that he had painlessly imported a
data file with 8 columns and 500,000 rows into matlab. When I tried
the same thing in R (both Unix and Windows variants) I had little
success. The Windows version hung for a very long time, until I
eventually more or less ran out of virtual memory; I tried to set the
proper memory allocations for the Unix version, but it never seemed
satisfied :]

I used read.table -- should I have used something else? Is it even
possible to work with this large files? I assume a memory-mapped
binary file would have been quite efficient (as opposed to an
in-memory parsed text file) -- is something like that even possible in
R?
#
Hi,

You should use scan() to read large ASCII tables.  If you save a dataframe
using save(), you get a binary file which works pretty fast.

Note that similar problems arise if you try to save big dataframes in ASCII
(you may consider my package savetable at
http://www.obs.ee/~siim/savetable_0.1.0.tar.gz in order to do that).

Best wishes,

Ott
On Wed, 28 Aug 2002, Magnus Lie Hetland wrote:
|A friend of mine recently mentioned that he had painlessly imported a
  |data file with 8 columns and 500,000 rows into matlab. When I tried
  |the same thing in R (both Unix and Windows variants) I had little
  |success. The Windows version hung for a very long time, until I
  |eventually more or less ran out of virtual memory; I tried to set the
  |proper memory allocations for the Unix version, but it never seemed
  |satisfied :]
  |
  |I used read.table -- should I have used something else? Is it even
  |possible to work with this large files? I assume a memory-mapped
  |binary file would have been quite efficient (as opposed to an
  |in-memory parsed text file) -- is something like that even possible in
  |R?
  |
  |

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
#
On Wed, 28 Aug 2002, Magnus Lie Hetland wrote:

            
That's not big: if numeric it is a 32Mb object.  People do do that quite
often (on machines with 512Mb or more, but memory is cheap).  So it is
hard to know what the problem is, but ?read.table gives some hints
(including using scan()).

I've just done an experiment. I generated 4m rnorms, made a matrix,
wrote them out.  Then.

AA <- read.table("foo.dat", nrows=5e5, comment.char="",
                 colClasses=rep("numeric", 8), header=T)

worked for me in about 20secs, using less than 150Mb.

That was painless, and all the speed-ups are documented in ?read.table.
Certainly possible to read binary files. That's what load/save do,
and see ?readBin to read binary files written by other formats.
Having a file that size in memory is not a problem.  Doing useful
analyses may be (especially in Matlab).
#
On Wed, Aug 28, 2002 at 05:38:50AM +0200, Magnus Lie Hetland wrote:
try 'scan'
have a look at the package 'rhdf5' (at least at www.bioconductor.org,
not sure it's on CRAN). Not exactly what you describe but could be
relevant.



Regards,


L.