reading in data with variable length

An embedded and charset-unspecified text was scrubbed...
Name: not available
Url: https://stat.ethz.ch/pipermail/r-help/attachments/20051206/a3531fe3/attachment.pl
I have very large csv files (up to 1GB each of ASCII text). I'd like to
be able to read them directly in to R. The problem I am having is with
the variable length of the data in each record.

  Here's a (simplified) example:

  $ cat foo.csv
Name,Start Month,Data
Foo,10,-0.5615,2.3065,0.1589,-0.3649,1.5955
Bar,21,0.0880,0.5733,0.0081,2.0253,-0.7602,0.7765,0.2810,1.8546,0.2696,0
.3316,0.1565,-0.4847,-0.1325,0.0454,-1.2114

  The records consist of rows with some set comma-separated fields
(e.g. the "Name" & "Start Month" fields in the above) and then the data
follow as a variable-length list of comma-separated values until a new
line is encountered.
While you may well get a good R solution from the experts,
in such a situation (as in so many) I would be tempted to
pre-process the file with 'awk' (installed by default on
Unix/Linux systems, available also for Windows).

The following will give you a CSV file with a constant number
of fields per line. While this does not eliminate the NAs which
you apparently find unsightly, it should be a fast and clean way
of doing the basic job, since it a line-by-line operation in
two passes, so there should be no question. of choking the
system (unless you run out of HD space as a result of creating
the second file).

Two passes, on the lines of
Pass 1:

  cat foo.csv | awk '
    BEGIN{FS=","; n=0}
    {m=NF; if(m>n){n=m}}
    END{print n} '

which gives you the maximum number of fields in any line.
Suppose (for example) that this number is 37.
Then Pass 2:

  cat foo.csv | awk -v maxF=37 '
    BEGIN{FS=","; OFS=","}
    {if(NF<maxF){$maxF=""}}
    {print $0} ' > newfoo.csv

Tiny example:
1) See foo.csv

  cat foo.csv 
  1
  1,2
  1,2,3
  1,2,3,4
  1,2

2) Pass 1:

  cat foo.csv | awk '
     BEGIN{FS=","; n=0}
     {m=NF; if(m>n){n=m}}
     END{print n} '
4
3) So we need 4 fields per line. With maxF=4, Pass 2:

  cat foo.csv | awk -v maxF=4 '
     BEGIN{FS=","; OFS=","}
     {if(NF<maxF){$maxF=""}}
     {print $0} ' > newfoo.csv

4) See newfoo.csv

  cat newfoo.csv
  1,,,
  1,2,,
  1,2,3,
  1,2,3,4
  1,2,,

So you now have a CSV file with a constant number of fields per line.

This doesn't make it into lists, though.

Hoping this helps,
Ted.

  Now I can use e.g.

  fileName="foo.csv"  
ta<-read.csv(fileName, header=F, skip=1, sep=",", dec=".", fill=T)  

  which does the job nicely:

     V1 V2      V3     V4     V5      V6      V7     V8    V9    V10   
V11    V12    V13     V14     V15    V16     V17
1 Foo 10 -0.5615 2.3065 0.1589 -0.3649  1.5955     NA    NA     NA    
NA     NA     NA      NA      NA     NA      NA
2 Bar 21  0.0880 0.5733 0.0081  2.0253 -0.7602 0.7765 0.281 1.8546
0.2696 0.3316 0.1565 -0.4847 -0.1325 0.0454 -1.2114

  but the problem is with files on the order of 1GB this either
crunches for ever or runs out of memory trying ... plus having all
those NAs isn't too pretty to look at. 

  (I have a MATLAB version that can read this stuff into an array of
cells in about 3 minutes).

  I really want a fast way to read the data part into a list; that way
I can access data in the array of lists containing the records by doing
something ta[[i]]$data.

  Ideas?

  Thanks,

  Jack.

---------------------------------

      [[alternative HTML version deleted]]

______________________________________________
R-help at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html
--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 06-Dec-05                                       Time: 18:08:54
------------------------------ XFMail ------------------------------
An embedded and charset-unspecified text was scrubbed...
Name: not available
Url: https://stat.ethz.ch/pipermail/r-help/attachments/20051206/2d2a89ce/attachment.pl