Problem with writing a file in UTF-8
This is asking FAR too much under Windows, which has no UTF-8 locales. In particular, cat() (on which write() is based) will convert to the native locale, even if you manage to input the string as an R UTF-8 string. And conversion is a OS service, so you are getting the conversion Windows sees as appropriate. The best way around this is to use a more capable OS. But you can do e.g.
x <- '\u0171\u0141' # ensure this really is "??" writeLines(x, 'foo', useBytes=TRUE) # ensure no conversion
On Mon, 21 Feb 2011, Matt Shotwell wrote:
Thomas,
I wasn't able to reproduce your finding. The last two characters in my
'out.txt' file were just as expected. But, I'm in an UTF-8 locale. Your
locale affects the encoding of characters on your platform. If you're
not in a UTF-8 locale, then characters are converted from your native
encoding to UTF-8 (when you specify encoding="UTF-8"). In the process of
conversion, it's possible to lose information. You can test whether
there is a loss (or a change rather) when R writes these characters like
so:
# what does ?? look like in binary (hex)?
raw_before <- charToRaw("??")
# write 'out.txt' as before
out <- file(description="out.txt", open="w", encoding="UTF-8")
write(x="??", file=out)
close(con=out)
# read in the two characters
out <- file(description="out.txt", open="r", encoding="UTF-8")
raw_after <- charToRaw(readChar(con=out, nchars=2))
close(con=out)
# compare the raw representations
identical(raw_before, raw_after)
This test passes on my machine. But, there's also the question of
whether these characters made it onto R-help list unaltered. Also,
please include the result of sessionInfo() in you subsequent messages.
Best,
Matt
sessionInfo()
R version 2.11.1 (2010-05-31) i686-pc-linux-gnu locale: [1] LC_CTYPE=en_US.utf8 LC_NUMERIC=C [3] LC_TIME=en_US.utf8 LC_COLLATE=en_US.utf8 [5] LC_MONETARY=C LC_MESSAGES=en_US.utf8 [7] LC_PAPER=en_US.utf8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base On Thu, 2011-02-17 at 13:54 -0800, tpklein wrote:
Hello, I am working with a data frame containg character strings with many special symbols from various European languages. When writing such character strings to a file using the UTF-8 encoding, some of them are converted in a strange way. See the following example, run in R 2.12.1 on Windows 7: out <- file( description="out.txt", open="w", encoding="UTF-8") write( x="???????", file=out ) close( con=out ) The last two symbols in the character string are converted to "uL" while all other characters are not changed (which is what I want). How to explain this? Does it have something to do with my locale? And is there a way to work around this problem? -- Any help would be greatly appreciated. Thomas
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595