strange behavior when reading csv - line wraps

In a private correspondence with Martin Tomko, I think the reason
for the problem has been found.

The numbers of ";"-separated fields in the 82 successive lines of
his file are as follows:

  01:26   02:26   03:33   04:33   05:12   06:12   07:12   08:12,
  09:19   10:19   11:17   12:17   13:23   14:23   15:23   16:23,
  17:23   18:23   19:23   20:23   21:23   22:23   23:23   24:23,
  25:23   26:23   27:23   28:23   29:23   30:23   31:23   32:23,
  33:23   34:23   35:23   36:23   37:23   38:23   39:23   40:23,
  41:23   42:23   43:23   44:23   45:23   46:23   47:23   48:23,
  49:23   50:23   51:23   52:23   53:23   54:23   55:23   56:23,
  57:23   58:23   59:23   60:23   61:34   62:34   63:34   64:34,
  65:13   66:13   67:38   68:38   69:20   70:20   71:44   72:20,
  73:19   74:19   75:20   76:44   77:20   78:19   79:19   80:20,
  81:25   82:25

So in the first 5 lines there is a maximum of 33 fields. Hence, since
there is no header line, read.csv() decides to allocate 33 columns.
(See ?read.csv).

There are the following distinct numbers of fields in the lines:

  12 13 17 19 20 23 25 26 33 34 38 44

so there are lines with 34, 38 and 44 fields. All lines in the CSV
file end with ";", hence there is an implicit blank field at the
end of each line. The lines with 34 fields have the 34th field blank,
so after the break there is presumably a "quasi blank input line"
where the 34th (blank) field has spilled over. Such input will be
ignored with the default "blank.lines.skip = TRUE" option to read,csv().
The longer lines (2 with 38 fields, 2 with 44) will be split after
the 33rd field, the remainder being taken as an additional input
line. As a result, there are 82 (= 82+4) rows in the resulting
dataframe.

This explanation is compatible with what Martin has observed.
The underlying forensic details were sniffed out with a couple
of passes through 'awk' scripts.

One solution is to call read.csv() with option "col.names=Xnn"
where Xnn is a constructed character vector with elements such
as "X01" "X02" ... "X44" (once one has determined, as above, that
there is a maximum of 44 fields per line in the file).

Ted.

strange behavior when reading csv - line wraps

Thread (9 messages)