When I have a csv file that is more than 6 lines long, not including
the header, and one of the fields is blank for the last few lines, and
there is an extra comma on of the lines with the blank field,
read.csv() makes creates an extra line.
I attached an example file; I'll also paste the contents here:
A,apple
A,orange
A,orange
A,orange
A,orange
A,,,
A,,
-----
wc -l reports that this file has 7 lines
R> system("wc -l test.csv")
7 test.csv
But, read.csv reads 8.
R> read.csv("test.csv", header=FALSE, stringsAsFactors=FALSE)
V1 V2
1 A apple
2 A orange
3 A orange
4 A orange
5 A orange
6 A
7
8 A
If I increase the number of commas at the end of the line, it
increases the number of rows.
This R command to read a 7 line csv:
read.csv(header=FALSE, text="A,apple
A,orange
A,orange
A,orange
A,orange
A,,,,,
A,,")
will produce this:
V1 V2
1 A apple
2 A orange
3 A orange
4 A orange
5 A orange
6 A
7
8
9 A
But if the file has fewer than 7 lines, it doesn't increase the number of rows.
This R command to read a 6 line csv:
read.csv(header=FALSE, text="A,apple
A,orange
A,orange
A,orange
A,,,,,
A,,")
will produce this:
V1 V2 V3 V4 V5 V6
1 A apple NA NA NA NA
2 A orange NA NA NA NA
3 A orange NA NA NA NA
4 A orange NA NA NA NA
5 A NA NA NA NA
6 A NA NA NA NA
Is this intended behavior?
Thanks,
Garrett See
R> version
_
platform x86_64-pc-linux-gnu
arch x86_64
os linux-gnu
system x86_64, linux-gnu
status
major 2
minor 15.2
year 2012
month 10
day 26
svn rev 61015
language R
version.string R version 2.15.2 (2012-10-26)
nickname Trick or Treat
R> sessionInfo()
R version 2.15.2 (2012-10-26)
Platform: x86_64-pc-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=C LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
read.csv reads more rows than indicated by wc -l
2 messages · G See, Ben Bolker
G See <gsee000 <at> gmail.com> writes:
When I have a csv file that is more than 6 lines long, not including
the header, and one of the fields is blank for the last few lines, and
there is an extra comma on of the lines with the blank field,
read.csv() makes creates an extra line.
I attached an example file; I'll also paste the contents here:
A,apple
A,orange
A,orange
A,orange
A,orange
A,,,
A,,
-----
wc -l reports that this file has 7 lines
R> system("wc -l test.csv")
7 test.csv
But, read.csv reads 8.
R> read.csv("test.csv", header=FALSE, stringsAsFactors=FALSE)
V1 V2
1 A apple
2 A orange
3 A orange
4 A orange
5 A orange
6 A
7
8 A
If I increase the number of commas at the end of the line, it
increases the number of rows.
This R command to read a 7 line csv:
read.csv(header=FALSE, text="A,apple
A,orange
A,orange
A,orange
A,orange
A,,,,,
A,,")
will produce this:
V1 V2
1 A apple
2 A orange
3 A orange
4 A orange
5 A orange
6 A
7
8
9 A
But if the file has fewer than 7 lines, it doesn't increase the number of rows.
This R command to read a 6 line csv:
read.csv(header=FALSE, text="A,apple
A,orange
A,orange
A,orange
A,,,,,
A,,")
will produce this:
V1 V2 V3 V4 V5 V6
1 A apple NA NA NA NA
2 A orange NA NA NA NA
3 A orange NA NA NA NA
4 A orange NA NA NA NA
5 A NA NA NA NA
6 A NA NA NA NA
Is this intended behavior?
Thanks,
Garrett See
[snip]
I don't know if it's exactly *intended* or not, but I think it's
more or less as [IMPLICITLY] documented. From ?read.table,
The number of data columns is determined by looking at the first
five lines of input (or the whole file if it has less than five
lines), or from the length of ?col.names? if it is specified and
is longer. This could conceivably be wrong if ?fill? or
?blank.lines.skip? are true, so specify ?col.names? if necessary
(as in the ?Examples?).
txt <- "A,apple
A,orange
A,orange
A,orange
A,orange
A,,,,,
A,,"
read.csv(header=FALSE, text=txt )
What is happening here is that
(1) read.table is determining from the first five lines that
there are two columns;
(2) when it gets to line six, it reads each set of two fields as a
separate row
If you try
read.csv(header=FALSE, text=txt, fill=FALSE,blank.lines.skip=FALSE)
you at least get an error.
But it gets worse:
txt2 <- "A,apple
A,orange
A,orange
A,orange
A,orange
A,b,c,d,e,f
A,g"
read.csv(header=FALSE, text=txt2, fill=FALSE,blank.lines.skip=FALSE)
produces bad results even though fill=FALSE and blank.lines.skip=FALSE ...
Even specifying col.names explicitly doesn't help:
read.csv(header=FALSE, text=txt2, col.names=paste0("V",1:2))
At least count.fields() does detect a problem ...
count.fields(textConnection(txt2),sep=",")
Somewhere on my wish/TO DO list is for someone to rewrite read.table for
better robustness *and* efficiency ...