Skip to content

count.fields inconsistent with read.table?

3 messages · Peter Dalgaard, Sam Steingold

#
Hi,

batch is a vector of lines returned by readLines from a
NL-line-terminated file, here is the relevant section:
=========================================================
AA	BB	CC	DD			EE	FF
GG	H

H	JJ	KK			LL	MM
=========================================================
as you can see, a line is corrupt; two CRLF's are inserted.
This is okay, I drop the bad lines, at least I hope I do:

  conn <- textConnection(batch)
  field.counts <- count.fields(conn, sep="\t", comment.char="", quote="")
  close(conn)
  good <- field.counts == 8  # this should drop all bad lines
  if (!all(good))
    batch <- batch[good]
  conn <- textConnection(batch)
  ret <- read.table(conn, sep="\t", comment.char="", quote="")
  close(conn)

I get this error in read.table():

Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  : 
  line 7151 did not have 8 elements

how come?!

also, is there some error recovery?
e.g., the code above is a part of a function - is there a way to recover
batch (without re-running the whole thing)?

Thanks!
#
On Feb 24, 2012, at 06:58 , Sam Steingold wrote:

            
Actually, I don't see... (It's pretty hard to count TAB characters by eye.)
You can do better than this in terms of providing clues for us: "batch" is a character vector, right? So recheck that count.fields returns all 8's after removal of bad lines. Also check that dimensions match -- is length(batch) actually the same as length(field.counts)? Finally, what is in line 7151?
Well you can try().

  
    
#
how about this?
I replaced TAB with ^I and CR with ^M.
is this better?

here I use <TAB> and <CR> instead:
so, you see, there are two data lines here: A..F - good, with 8 fields.
G..M - BAD two CRLF's inserted inside the 2nd field, turning one line
into 3 lines.
so I must drop 3 input lines from the input.
batch <- lines[807000:808000]
 conn <- textConnection(batch)
 field.counts <- count.fields(conn, sep="\t", comment.char="", quote="")
 close(conn)
 good <- field.counts == length(col.names)
 which(!good)
[1] 152 153

## WRONG: it should be 3 lines, 154 is also bad - see above

 batch[!good]
[1] "GG\tH" ""                     
 length(batch)
[1] 1001
 length(good)
[1] 1000

## WRONG: batch, field.counts and good should have the same length
 
AHA! blank.lines.skip !!!
I must set it to FALSE!!!
and it does fix the problem...
that's the first line with a <CR>:

GG<TAB>H<CR>
it appears that try gives me access to the error message, not the
erroneous data, i.e., I still have to reload the file to get the batch
string vector.