I have a humongous csv file containing census data, far too big to read into
RAM. I have been trying to extract individual columns from this file using
the colbycol package. This works for certain subsets of the columns, but not
for others. I have not yet been able to precisely identify the problem
columns, as there are 731 columns and running colbycol on the file on my old
slow machine takes about 6 hours.
However, my suspicion is that there are some funky characters, either
control characters or characters with some non-standard encoding, somewhere
in this 14 gig file. Moreover, I am concerned that these characters may
cause me trouble down the road even if I use a different approach to getting
columns out of the file.
Is there an r utility will search through my file without trying to read it
all into memory at one time and find non-standard characters or misplaced
(non-end-of-line) control characters? Or some R code to the same end? Even
if the real problem ultimately proves top be different, it would be helpful
to eliminate this possibility. And this is also something I would routinely
run on files from external sources if I had it.
I am working in a windows XP environment, in case that makes a difference.
Any help anyone could offer would be greatly appreciated.
Sincerely, andrewH