Skip to content
Prev 313688 / 398502 Next

troubles reading a text file

Hi Igor,

It appears that the encoding is UTF-16.
[1] "??" ""      ""      ""      ""      ""      ""      ""      ""
   ""      ""      ""      ""
[14] ""      ""      ""      ""      ""      ""      ""

A search for "??" leads to the Wikipedia page
http://en.wikipedia.org/wiki/Byte_order_mark, specifically UTF-16
section.
user  system elapsed
 28.556   0.112  28.712
[1] 18001
YYYYMM X79.75N.49.75W X79.75N.49.25W X79.75N.48.75W X79.75N.48.25W
X79.75N.47.75W X79.75N.47.25W
1 176512         -32.61         -32.92         -33.34         -33.65
      -34.09         -34.21
2 176601         -31.89         -31.96         -32.26         -32.48
      -32.71         -33.03
  X79.75N.46.75W X79.75N.46.25W X79.75N.45.75W
1         -34.65         -34.98         -35.43
2         -33.29         -33.41         -33.76

Here you can see that I have downloaded just the first 1 MB of the
file, so it only has two lines after the header, but 28 seconds to
read it... I'm not sure how long it would take to read.table on the
whole ~600 MB file.

scan() might be faster:
(and this does not require setting options(encoding="UTF-16"))
Read 36002 items
   user  system elapsed
  0.104   0.000   0.104
Read 18001 items
YYYYMM 79.75N/49.75W 79.75N/49.25W 79.75N/48.75W 79.75N/48.25W
79.75N/47.75W 79.75N/47.25W
[1,] 176512        -32.61        -32.92        -33.34        -33.65
    -34.09        -34.21
[2,] 176601        -31.89        -31.96        -32.26        -32.48
    -32.71        -33.03
     79.75N/46.75W 79.75N/46.25W 79.75N/45.75W
[1,]        -34.65        -34.98        -35.43
[2,]        -33.29        -33.41        -33.76

(note the different colnames, similar to using check.names=FALSE in
read.table, and the result is a matrix, not a data frame as returned
by read.table)

HTH,
Jeff
On Sun, Dec 16, 2012 at 6:23 AM, <Igor.Drobyshev2 at uqat.ca> wrote: