(1) read.table(), with sep="\t", identifies 13 our of 1400 records, in a file with 1400 records of 3 fields each, as having only 2 fields. This happens under version 2.3.1 for Windows as well as with R 2.3.1 for Mac OS X, and with R-devel under Mac OS X. [R version 2.4.0 Under development (unstable) (2006-07-03 r38478)] (2) Using read.table() with sep="\t", the first 1569 records only of a 1821 record file are input. The file has exactly two fields in each record, and the minimum length of the second field is 1 character. If however I extract lines 1561 to 1650 from the file (the file "short.txt" below), all 90 lines are input. > webtwo <- "http://www.maths.anu.edu.au/~johnm/testfiles/twotabs.txt" > xy <- read.table(url(webtwo), sep="\t") Warning message: number of items read is not a multiple of the number of columns > z <- count.fields(url(webtwo), sep="\t") > table(z) z 2 3 13 1387 > table(sapply(strsplit(readLines(url(webtwo)), split="\t"), length)) 3 1400 > readLines(url(webtwo))[z==2][9:13] # last 5 as a sample (shorter lines) [1] "865\tlinear model (lm)! Cook's distance\t152" [2] "1019\tlinear model (lm)! Cook's distance\t177" [3] "1048\tlinear model (lm)! Cook's distance\t183" [4] "1082\tlinear model (lm)! Cook's distance\t187" [5] "1220\tlinear model (lm)! Cook's distance\t214" > weblong <- "http://www.maths.anu.edu.au/~johnm/testfiles/long.txt" > webshort <- "http://www.maths.anu.edu.au/~johnm/testfiles/short.txt" > xyLong <- read.table(url(weblong), sep="\t") > dim(xyLong) # Should be 1821 x 2 [1] 1569 2 > xyShort <- read.table(url(webshort), sep="\t") > dim(xyShort) # Should be, and will be, 90 x 2 [1] 90 2 > long <- readLines(url(weblong)) > short <- readLines(url(webshort)) > length(long) [1] 1821 > length(short) [1] 90 > all(long[1561:1650]==short) # short is lines 1561:1650 of long [1] TRUE > ## Moreover strsplit() can pick up the \t's correctly > lsplit <- strsplit(long, "\t") > table(sapply(lsplit, length)) 2 1821 > # Try also table(sapply(lsplit, function(x)x[2])) --please do not edit the information below-- Version: platform = powerpc-apple-darwin8.6.0 arch = powerpc os = darwin8.6.0 system = powerpc, darwin8.6.0 status = major = 2 minor = 3.1 year = 2006 month = 06 day = 01 svn rev = 38247 language = R version.string = Version 2.3.1 (2006-06-01) Locale: C Search Path: .GlobalEnv, package:lattice, package:methods, package:stats, package:graphics, package:grDevices, package:utils, package:datasets, Autoloads, package:base
read.table() errors with tab as separator (PR#9061)
6 messages · John Maindonald, Peter Dalgaard, Brian Ripley
John.Maindonald at anu.edu.au writes:
(1) read.table(), with sep="\t", identifies 13 our of 1400 records, in a file with 1400 records of 3 fields each, as having only 2 fields. This happens under version 2.3.1 for Windows as well as with R 2.3.1 for Mac OS X, and with R-devel under Mac OS X. [R version 2.4.0 Under development (unstable) (2006-07-03 r38478)] (2) Using read.table() with sep="\t", the first 1569 records only of a 1821 record file are input. The file has exactly two fields in each record, and the minimum length of the second field is 1 character. If however I extract lines 1561 to 1650 from the file (the file "short.txt" below), all 90 lines are input.
Notice that the single quote is a quote character in read.table (as opposed to read.delim, which uses only the double quote, to cater for TAB-separated files from Excel & friends).
[1] "865\tlinear model (lm)! Cook's distance\t152"
^
!!!!
(This reminds me that we probably should shift the default for
comment.char too since it leads to similar issues, but it seems not to
be the problem in this case.)
O__ ---- Peter Dalgaard ?ster Farimagsgade 5, Entr.B c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) FAX: (+45) 35327907
On Wed, 5 Jul 2006, Peter Dalgaard wrote:
John.Maindonald at anu.edu.au writes:
(1) read.table(), with sep="\t", identifies 13 our of 1400 records, in a file with 1400 records of 3 fields each, as having only 2 fields. This happens under version 2.3.1 for Windows as well as with R 2.3.1 for Mac OS X, and with R-devel under Mac OS X. [R version 2.4.0 Under development (unstable) (2006-07-03 r38478)] (2) Using read.table() with sep="\t", the first 1569 records only of a 1821 record file are input. The file has exactly two fields in each record, and the minimum length of the second field is 1 character. If however I extract lines 1561 to 1650 from the file (the file "short.txt" below), all 90 lines are input.
Notice that the single quote is a quote character in read.table (as opposed to read.delim, which uses only the double quote, to cater for TAB-separated files from Excel & friends).
[1] "865\tlinear model (lm)! Cook's distance\t152"
^
!!!!
(This reminds me that we probably should shift the default for
comment.char too since it leads to similar issues, but it seems not to
be the problem in this case.)
This seems to imply that we should change the default for 'quote': to do so could break a lot of scripts. (Given how long the default has been comment.char="#", I doubt if we should change that either.)
Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
Prof Brian Ripley <ripley at stats.ox.ac.uk> writes:
On Wed, 5 Jul 2006, Peter Dalgaard wrote:
John.Maindonald at anu.edu.au writes:
(1) read.table(), with sep="\t", identifies 13 our of 1400 records, in a file with 1400 records of 3 fields each, as having only 2 fields. This happens under version 2.3.1 for Windows as well as with R 2.3.1 for Mac OS X, and with R-devel under Mac OS X. [R version 2.4.0 Under development (unstable) (2006-07-03 r38478)] (2) Using read.table() with sep="\t", the first 1569 records only of a 1821 record file are input. The file has exactly two fields in each record, and the minimum length of the second field is 1 character. If however I extract lines 1561 to 1650 from the file (the file "short.txt" below), all 90 lines are input.
Notice that the single quote is a quote character in read.table (as opposed to read.delim, which uses only the double quote, to cater for TAB-separated files from Excel & friends).
[1] "865\tlinear model (lm)! Cook's distance\t152"
^
!!!!
(This reminds me that we probably should shift the default for
comment.char too since it leads to similar issues, but it seems not to
be the problem in this case.)
This seems to imply that we should change the default for 'quote': to do so could break a lot of scripts. (Given how long the default has been comment.char="#", I doubt if we should change that either.)
Sorry, unclear. We already change quote= for read.delim and read.csv, and I was suggesting also to modify the default for comment.char for those functions, but definitely not for read.table. Arguably, those functions are there to handle file formats generated by other programs, and it is unlikely that such programs will generate comment lines starting with #, whereas we have learned that Excel will occasionally generate fields like #NULL#, which mess up the parsing.
O__ ---- Peter Dalgaard ?ster Farimagsgade 5, Entr.B c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) FAX: (+45) 35327907
On Wed, 5 Jul 2006, Peter Dalgaard wrote:
Prof Brian Ripley <ripley at stats.ox.ac.uk> writes:
On Wed, 5 Jul 2006, Peter Dalgaard wrote:
John.Maindonald at anu.edu.au writes:
(1) read.table(), with sep="\t", identifies 13 our of 1400 records, in a file with 1400 records of 3 fields each, as having only 2 fields. This happens under version 2.3.1 for Windows as well as with R 2.3.1 for Mac OS X, and with R-devel under Mac OS X. [R version 2.4.0 Under development (unstable) (2006-07-03 r38478)] (2) Using read.table() with sep="\t", the first 1569 records only of a 1821 record file are input. The file has exactly two fields in each record, and the minimum length of the second field is 1 character. If however I extract lines 1561 to 1650 from the file (the file "short.txt" below), all 90 lines are input.
Notice that the single quote is a quote character in read.table (as opposed to read.delim, which uses only the double quote, to cater for TAB-separated files from Excel & friends).
[1] "865\tlinear model (lm)! Cook's distance\t152"
^
!!!!
(This reminds me that we probably should shift the default for
comment.char too since it leads to similar issues, but it seems not to
be the problem in this case.)
This seems to imply that we should change the default for 'quote': to do so could break a lot of scripts. (Given how long the default has been comment.char="#", I doubt if we should change that either.)
Sorry, unclear. We already change quote= for read.delim and read.csv, and I was suggesting also to modify the default for comment.char for those functions, but definitely not for read.table. Arguably, those functions are there to handle file formats generated by other programs, and it is unlikely that such programs will generate comment lines starting with #, whereas we have learned that Excel will occasionally generate fields like #NULL#, which mess up the parsing.
Ah, that does seem a sensible defensive move.
Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
Prof Brian Ripley <ripley at stats.ox.ac.uk> writes:
Sorry, unclear. We already change quote= for read.delim and read.csv, and I was suggesting also to modify the default for comment.char for those functions, but definitely not for read.table. Arguably, those functions are there to handle file formats generated by other programs, and it is unlikely that such programs will generate comment lines starting with #, whereas we have learned that Excel will occasionally generate fields like #NULL#, which mess up the parsing.
Ah, that does seem a sensible defensive move.
Committed for r-devel (only). Make check seems happy, but we may need to watch out for the package checking.
O__ ---- Peter Dalgaard ?ster Farimagsgade 5, Entr.B c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) FAX: (+45) 35327907