Skip to content

troubles reading a text file

4 messages · Igor.Drobyshev2 at uqat.ca, Jeffrey Dick, David Winsemius

#
Hi Igor,

It appears that the encoding is UTF-16.
[1] "??" ""      ""      ""      ""      ""      ""      ""      ""
   ""      ""      ""      ""
[14] ""      ""      ""      ""      ""      ""      ""

A search for "??" leads to the Wikipedia page
http://en.wikipedia.org/wiki/Byte_order_mark, specifically UTF-16
section.
user  system elapsed
 28.556   0.112  28.712
[1] 18001
YYYYMM X79.75N.49.75W X79.75N.49.25W X79.75N.48.75W X79.75N.48.25W
X79.75N.47.75W X79.75N.47.25W
1 176512         -32.61         -32.92         -33.34         -33.65
      -34.09         -34.21
2 176601         -31.89         -31.96         -32.26         -32.48
      -32.71         -33.03
  X79.75N.46.75W X79.75N.46.25W X79.75N.45.75W
1         -34.65         -34.98         -35.43
2         -33.29         -33.41         -33.76

Here you can see that I have downloaded just the first 1 MB of the
file, so it only has two lines after the header, but 28 seconds to
read it... I'm not sure how long it would take to read.table on the
whole ~600 MB file.

scan() might be faster:
(and this does not require setting options(encoding="UTF-16"))
Read 36002 items
   user  system elapsed
  0.104   0.000   0.104
Read 18001 items
YYYYMM 79.75N/49.75W 79.75N/49.25W 79.75N/48.75W 79.75N/48.25W
79.75N/47.75W 79.75N/47.25W
[1,] 176512        -32.61        -32.92        -33.34        -33.65
    -34.09        -34.21
[2,] 176601        -31.89        -31.96        -32.26        -32.48
    -32.71        -33.03
     79.75N/46.75W 79.75N/46.25W 79.75N/45.75W
[1,]        -34.65        -34.98        -35.43
[2,]        -33.29        -33.41        -33.76

(note the different colnames, similar to using check.names=FALSE in
read.table, and the result is a matrix, not a data frame as returned
by read.table)

HTH,
Jeff
On Sun, Dec 16, 2012 at 6:23 AM, <Igor.Drobyshev2 at uqat.ca> wrote:
#
On Dec 15, 2012, at 2:23 PM, <Igor.Drobyshev2 at uqat.ca> wrote:

            
After inspecting a small (8 MB fragment downloaded with an ftp client) with both Firefox and TextEdit.app and seeing that they reported this to be UTF-16 encoded, I saved it from TextEdit as UTF-8 and then could view it with R readLines. These are the first 7 lines and the beginning of the eighth:
[1] "NAME \"Monthly European Temperatures 1766-2000 [T=2m, Celsius]\""                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
 [2] "LONGITUDES\t180\t50.00W\t40.00E\t"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
 [3] "LATITUDES\t100\t80.00N\t30.00N\t"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
 [4] "NODATA_STRING\tNA"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
 [5] "NUMBER_OF_ROWS\t2820"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
 [6] "NUMBER_OF_COLUMNS\t18001\t"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
 [7] ""                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
 [8] "YYYYMM\t79.75N/49.75W\t79.75N/49.25W\t79.75N/48.75W\t79.75N/48.25W\t79.75N/47.75W\t79.75N/47.25W\t79.75N/46.75W\t79.75N/46.25W\t79.75N/45.75W\t79.75N/45.25W\t79.75N/44.75W\t79.75N/44.25W\t79.7

As you can readily see it isa tab-separated file. I was able to get partial success ( reading the first three lines anyway) with:
YYYYMM X79.75N.49.75W X79.75N.49.25W X79.75N.48.75W X79.75N.48.25W X79.75N.47.75W X79.75N.47.25W X79.75N.46.75W X79.75N.46.25W X79.75N.45.75W
1 176512         -32.61         -32.92         -33.34         -33.65         -34.09         -34.21         -34.65         -34.98         -35.43
YYYYMM X79.75N.49.75W X79.75N.49.25W X79.75N.48.75W X79.75N.48.25W X79.75N.47.75W X79.75N.47.25W X79.75N.46.75W X79.75N.46.25W X79.75N.45.75W
1 176512         -32.61         -32.92         -33.34         -33.65         -34.09         -34.21         -34.65         -34.98         -35.43
2 176601         -31.89         -31.96         -32.26         -32.48         -32.71         -33.03         -33.29         -33.41         -33.76
3 176602         -34.31         -34.40         -34.60         -34.79         -35.01         -35.13         -35.46         -35.57         -35.91
That on the other hand suggests you have inadequate machine resources for this job. Perhaps you should be thinking of using other tools than R for this project ... or buying more ram. You should probably have 32 GB for a job this size.
Partially correct but perhaps your problems are multifactorial. 

I was able to get this to "work" from that webste:
'data.frame':	3 obs. of  10 variables:
 $ YYYYMM        : int  176512 176601 176602
 $ X79.75N.49.75W: num  -32.6 -31.9 -34.3
 $ X79.75N.49.25W: num  -32.9 -32 -34.4
 $ X79.75N.48.75W: num  -33.3 -32.3 -34.6
 $ X79.75N.48.25W: num  -33.6 -32.5 -34.8
 $ X79.75N.47.75W: num  -34.1 -32.7 -35
 $ X79.75N.47.25W: num  -34.2 -33 -35.1
 $ X79.75N.46.75W: num  -34.6 -33.3 -35.5
 $ X79.75N.46.25W: num  -35 -33.4 -35.6
 $ X79.75N.45.75W: num  -35.4 -33.8 -35.9
#
On Dec 15, 2012, at 8:45 PM, David Winsemius wrote:

            
I was wrong about that. The object size in a 64 bit R was:

inp      291382512
[1] 2820
[1] 18001
[1] 2820

So it seems to be all there. It's considerably smaller than I guessed.