I have very large csv files (up to 1GB each of ASCII text).
I'd like to be able to read them directly in to R. The
problem I am having is with the variable length of the data
in each record.
Here's a (simplified) example:
$ cat foo.csv
Name,Start Month,Data
Foo,10,-0.5615,2.3065,0.1589,-0.3649,1.5955
Bar,21,0.0880,0.5733,0.0081,2.0253,-0.7602,0.7765,0.2810,1.854
6,0.2696,0.3316,0.1565,-0.4847,-0.1325,0.0454,-1.2114
The records consist of rows with some set comma-separated
fields (e.g. the "Name" & "Start Month" fields in the above)
and then the data follow as a variable-length list of
comma-separated values until a new line is encountered.
Now I can use e.g.
fileName="foo.csv"
ta<-read.csv(fileName, header=F, skip=1, sep=",", dec=".", fill=T)
which does the job nicely:
V1 V2 V3 V4 V5 V6 V7 V8 V9
V10 V11 V12 V13 V14 V15 V16 V17
1 Foo 10 -0.5615 2.3065 0.1589 -0.3649 1.5955 NA NA
NA NA NA NA NA NA NA NA
2 Bar 21 0.0880 0.5733 0.0081 2.0253 -0.7602 0.7765 0.281
1.8546 0.2696 0.3316 0.1565 -0.4847 -0.1325 0.0454 -1.2114
but the problem is with files on the order of 1GB this
either crunches for ever or runs out of memory trying ...
plus having all those NAs isn't too pretty to look at.
(I have a MATLAB version that can read this stuff into an
array of cells in about 3 minutes).
I really want a fast way to read the data part into a list;
that way I can access data in the array of lists containing
the records by doing something ta[[i]]$data.
Ideas?
Thanks,
Jack.
---------------------------------
[[alternative HTML version deleted]]