Skip to content

strange behavior when reading csv - line wraps

9 messages · Martin Tomko, jim holtman, (Ted Harding)

#
Dear All,
I am observing a strange behavior and searching the archives and help 
pages didn't help much.
I have a csv with a variable number of fields in each line.

I use
dataPoints <- read.csv(inputFile, head=FALSE, sep=";",fill =TRUE);

to read it in, and it works. But - some lines are long and 'wrap', or 
split and continue on the next line. So when I check the dim of the 
frame, they are not correct and I can see when I do a printout that the 
lines is split into two in the frame. I checked the input file and all 
is good.

an example of the input is:
37;2175168475;13;8.522729;47.19537;16366682 at N00;30;sculpture;bird;tourism;animal;statue;canon;eos;rebel;schweiz;switzerland;eagle;swiss;adler;skulptur;zug;1750;28;tamron;f28;canton;tourismus;vogel;baar;kanton;xti;tamron1750;1750mm;tamron1750mm;400d;rabbitriotnet;

where the last values occurs on the next line in the data frame.

It does not have to be the last value, as in the follwong example, the 
word "kempten" starts the next line:
39;167757703;12;10.309295;47.724545;21903142 at N00;36;white;building;tower;clock;clouds;germany;bayern;deutschland;bavaria;europa;europe;eagle;adler;eu;wolke;dome;townhall;rathaus;turm;weiss;allemagne;europeanunion;bundesrepublik;gebaeude;glocke;brd;allgau;kuppel;europ;kempten;niemcy;europo;federalrepublic;europaischeunion;europaeischeunion;germanio;

What could be the reason?

I ws thinking about solving the issue by using a different separator, 
that I would use for the first 7 fields and concatenating all of the 
remaining values into a single stirng value, but could not figure out 
how to do such a substitution in R. Unfortunately, on my system I cannot 
specify a range for sed...

Thanks for any help/pointers
Martin
#
Jim,
the two lines I put in are the actual problematic input lines.
In these examples, there are no quotes nor # signs, although I have no 
means to make sure they do not occur in the inputs (any hints how I 
could deal with that?).
I am trying to avoid as much pre-processing outside R as possible, and I 
have to process about 500 files with up to 3000 records each, so I need 
a more or less automated/batch solution. - so any string substitution 
will have to occur in R. But for the moment, I do not see a reaason for 
substitution, and the wrapping still occurs.

Cheers
Martin
jim holtman wrote:
#
In a private correspondence with Martin Tomko, I think the reason
for the problem has been found.

The numbers of ";"-separated fields in the 82 successive lines of
his file are as follows:

  01:26   02:26   03:33   04:33   05:12   06:12   07:12   08:12,
  09:19   10:19   11:17   12:17   13:23   14:23   15:23   16:23,
  17:23   18:23   19:23   20:23   21:23   22:23   23:23   24:23,
  25:23   26:23   27:23   28:23   29:23   30:23   31:23   32:23,
  33:23   34:23   35:23   36:23   37:23   38:23   39:23   40:23,
  41:23   42:23   43:23   44:23   45:23   46:23   47:23   48:23,
  49:23   50:23   51:23   52:23   53:23   54:23   55:23   56:23,
  57:23   58:23   59:23   60:23   61:34   62:34   63:34   64:34,
  65:13   66:13   67:38   68:38   69:20   70:20   71:44   72:20,
  73:19   74:19   75:20   76:44   77:20   78:19   79:19   80:20,
  81:25   82:25

So in the first 5 lines there is a maximum of 33 fields. Hence, since
there is no header line, read.csv() decides to allocate 33 columns.
(See ?read.csv).

There are the following distinct numbers of fields in the lines:

  12 13 17 19 20 23 25 26 33 34 38 44

so there are lines with 34, 38 and 44 fields. All lines in the CSV
file end with ";", hence there is an implicit blank field at the
end of each line. The lines with 34 fields have the 34th field blank,
so after the break there is presumably a "quasi blank input line"
where the 34th (blank) field has spilled over. Such input will be
ignored with the default "blank.lines.skip = TRUE" option to read,csv().
The longer lines (2 with 38 fields, 2 with 44) will be split after
the 33rd field, the remainder being taken as an additional input
line. As a result, there are 82 (= 82+4) rows in the resulting
dataframe.

This explanation is compatible with what Martin has observed.
The underlying forensic details were sniffed out with a couple
of passes through 'awk' scripts.

One solution is to call read.csv() with option "col.names=Xnn"
where Xnn is a constructed character vector with elements such
as "X01" "X02" ... "X44" (once one has determined, as above, that
there is a maximum of 44 fields per line in the file).

Ted.
On 30-May-09 19:43:47, jim holtman wrote:
--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 30-May-09                                       Time: 21:15:13
------------------------------ XFMail ------------------------------
#
Dear Jim,
with the help of Ted, we diagnosed that the cause is in the extreme 
variability in line length during reading in. As the table column number 
is apparently determined fro mthe first five lines, what exceeds this 
length gets automatically on the next line.
I am now trying to find a way to read in the data despite this. I have 
no control over the table extent, the only thing that would make sense 
according to my data would be to read in a fixed number of columns and 
merge all remaining columns as a long string in the last one. No idea 
how to do this, though.

Thanks
Martin
jim holtman wrote:

  
    
#
Ah!!! It was count.fields() which we had overlooked! We discoveered
a work-round which involved using 

  Data0 <- readLines(file)

to create a vector of strings, one for each line of the input file,
and then using

  NF <- unlist(lapply(R0,function(x)
        length(unlist(gregexpr(";",x,fixed=TRUE,useBytes=TRUE))))))

to count the number of occurrences of ";" (the separator) in each line.
(NF+1) produces the same result as count.fields(file,sep=";"). 

Thanks for pointing out the existence of count.fields()!
Ted.
On 31-May-09 15:04:23, jim holtman wrote:
--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 31-May-09                                       Time: 16:24:27
------------------------------ XFMail ------------------------------
#
Big thanks to Ted and Jim for all the help.
Martin
(Ted Harding) wrote: