Skip to content

reading in data with variable length

3 messages · John McHenry, (Ted Harding)

#
On 06-Dec-05 John McHenry wrote:
While you may well get a good R solution from the experts,
in such a situation (as in so many) I would be tempted to
pre-process the file with 'awk' (installed by default on
Unix/Linux systems, available also for Windows).

The following will give you a CSV file with a constant number
of fields per line. While this does not eliminate the NAs which
you apparently find unsightly, it should be a fast and clean way
of doing the basic job, since it a line-by-line operation in
two passes, so there should be no question. of choking the
system (unless you run out of HD space as a result of creating
the second file).

Two passes, on the lines of
Pass 1:

  cat foo.csv | awk '
    BEGIN{FS=","; n=0}
    {m=NF; if(m>n){n=m}}
    END{print n} '

which gives you the maximum number of fields in any line.
Suppose (for example) that this number is 37.
Then Pass 2:

  cat foo.csv | awk -v maxF=37 '
    BEGIN{FS=","; OFS=","}
    {if(NF<maxF){$maxF=""}}
    {print $0} ' > newfoo.csv


Tiny example:
1) See foo.csv

  cat foo.csv 
  1
  1,2
  1,2,3
  1,2,3,4
  1,2

2) Pass 1:

  cat foo.csv | awk '
     BEGIN{FS=","; n=0}
     {m=NF; if(m>n){n=m}}
     END{print n} '
3) So we need 4 fields per line. With maxF=4, Pass 2:

  cat foo.csv | awk -v maxF=4 '
     BEGIN{FS=","; OFS=","}
     {if(NF<maxF){$maxF=""}}
     {print $0} ' > newfoo.csv

4) See newfoo.csv

  cat newfoo.csv
  1,,,
  1,2,,
  1,2,3,
  1,2,3,4
  1,2,,

So you now have a CSV file with a constant number of fields per line.

This doesn't make it into lists, though.

Hoping this helps,
Ted.
--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 06-Dec-05                                       Time: 18:08:54
------------------------------ XFMail ------------------------------