troubles reading a text file
Hi Igor, It appears that the encoding is UTF-16.
readLines("temp-mon.txt")
[1] "??" "" "" "" "" "" "" "" "" "" "" "" "" [14] "" "" "" "" "" "" "" A search for "??" leads to the Wikipedia page http://en.wikipedia.org/wiki/Byte_order_mark, specifically UTF-16 section.
options(encoding="UTF-16")
system.time(Temperature<-read.table("temp-mon.txt",skip = 7, header = TRUE, na.strings="NA",sep=""))
user system elapsed 28.556 0.112 28.712
ncol(Temperature)
[1] 18001
Temperature[, 1:10]
YYYYMM X79.75N.49.75W X79.75N.49.25W X79.75N.48.75W X79.75N.48.25W
X79.75N.47.75W X79.75N.47.25W
1 176512 -32.61 -32.92 -33.34 -33.65
-34.09 -34.21
2 176601 -31.89 -31.96 -32.26 -32.48
-32.71 -33.03
X79.75N.46.75W X79.75N.46.25W X79.75N.45.75W
1 -34.65 -34.98 -35.43
2 -33.29 -33.41 -33.76
Here you can see that I have downloaded just the first 1 MB of the
file, so it only has two lines after the header, but 28 seconds to
read it... I'm not sure how long it would take to read.table on the
whole ~600 MB file.
scan() might be faster:
(and this does not require setting options(encoding="UTF-16"))
system.time(Temperature <- scan("temp-mon.txt", fileEncoding="UTF-16", skip=8))
Read 36002 items user system elapsed 0.104 0.000 0.104
Temperature <- matrix(Temperature, ncol=18001, byrow=TRUE)
Temperature.colnames <- scan("temp-mon.txt", character(), fileEncoding="UTF-16", skip=7, nmax=18001)
Read 18001 items
colnames(Temperature) <- Temperature.colnames Temperature[, 1:10]
YYYYMM 79.75N/49.75W 79.75N/49.25W 79.75N/48.75W 79.75N/48.25W
79.75N/47.75W 79.75N/47.25W
[1,] 176512 -32.61 -32.92 -33.34 -33.65
-34.09 -34.21
[2,] 176601 -31.89 -31.96 -32.26 -32.48
-32.71 -33.03
79.75N/46.75W 79.75N/46.25W 79.75N/45.75W
[1,] -34.65 -34.98 -35.43
[2,] -33.29 -33.41 -33.76
(note the different colnames, similar to using check.names=FALSE in
read.table, and the result is a matrix, not a data frame as returned
by read.table)
HTH,
Jeff
On Sun, Dec 16, 2012 at 6:23 AM, <Igor.Drobyshev2 at uqat.ca> wrote:
Dear R experts, For quite some time I have been trying to solve a mistery of reading a seemingly trouble-free text file. The data is temperature reconstruction arranged as a huge grid, preceded by seven "header lines" (which you see better if file is opened in Firefox or Chrome). This is the data (gridded temperature reconstruction) ftp://ftp.ncdc.noaa.gov/pub/data/paleo/historical/europe/casty2007/temp-mon.txt And this is original data description: ftp://ftp.ncdc.noaa.gov/pub/data/paleo/historical/europe/casty2007/readme-casty2007.txt Basically, it is says "space-delimited ASCII format" there ... I tried this: Temperature<-read.table(FileName,skip = 7, header = TRUE, na.strings="NA",sep="") But ..
Temperature <- read.table(FileName, skip = 7, header = FALSE, sep="")
Error in read.table(FileName, skip = 7, header = FALSE, sep = "") : empty beginning of file Trying read.csv gives this: Error: cannot allocate vector of size 370.5 Mb I attempted to handle this by opening and resaving the file in another software, but even if I can still see the first lines of the file in the import dialog, the full reading of the file always ends up with an error, possibly because of the huge humber of columns .. I believe the problem is with some special encoding but I cannot figure out how to go around it. Could some of you give me any hint on that? many thanks in advance Igor Igor Drobyshev Dendrochronological laboratory at Station de Recheche FERLD, director Chaire industrielle CRSNG-UQAT-UQAM en am?nagement forestier durable Universit? du Qu?bec en Abitibi-T?miscamingue 445 boul . de l'Universit? Rouyn-Noranda, QC Canada J9X5E4 http://www.dendro.uqat.ca/ [[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.