Skip to content

write.csv covert Åland to <c5>land

10 messages · Dr Eberhard W Lisse, Jinsong Zhao, John Kane +2 more

#
Hi there,

I tried to export the names of country to a csv file with write.csv(). 
In the resulted file, ?land was coverted to <c5>land. Is there any way 
could prevent this happening? Thanks!

 > abc
[1] "?land"
 > write.table(abc, file = "")
"x"
"1" "<c5>land"

Best,
Jinsong
#
?file.write()

look for fileEncoding?

el
On 20/10/2020 11:13, Jinsong Zhao wrote:

  
    
#
On 2020/10/20 17:23, Dr Eberhard W Lisse wrote:
There is no file.write(). I have tried fileEncoding = "utf8" and 
"latin1" in write.csv(). However, it does not have effect. The output is 
is <U+00C5>land or <c5>land.

Best,
Jinsong
#
Apologies, 

I meant

?write.table()

el
On 20/10/2020 12:38, Jinsong Zhao wrote:
[...]
#
Perhaps

?readr::write_delim()

el
On 20/10/2020 12:45, Dr Eberhard W Lisse wrote:

  
    
#
Hi there,

Why the same string is displayed in different form?

 > abc[,1]
[1] "?land"       "Afghanistan"
 > abc
          name
1    <c5>land
2 Afghanistan

And more...

 > dput(abc, "aa.txt")
 > dget("aa.txt")
          name
1    <c5>land
2 Afghanistan
 > dget("aa.txt")[,1]
[1] "<c5>land"    "Afghanistan"

Best,
Jinsong
On 2020/10/20 17:13, Jinsong Zhao wrote:
#
It looks like an encoding problem.

It works fine for me with R encoding set to UTF-8

Here is part of my sessionInfo() results
[1] LC_CTYPE=en_CA.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_CA.UTF-8        LC_COLLATE=en_CA.UTF-8
 [5] LC_MONETARY=en_CA.UTF-8    LC_MESSAGES=en_CA.UTF-8

I would suggest issuing the command
sessionInfo()
and seeing what your encoding is.
On Tue, 20 Oct 2020 at 08:22, Jinsong Zhao <jszhao at yeah.net> wrote:

            

  
    
#
You don't say, but I'd guess you're using Windows.  In your code page, 
the character ? is probably not representable.  At some point in the 
sequence of operations involved in printing the dataframe R puts the 
string into the native encoding, and since that's impossible on your 
system, it substitutes the <c5> instead.  The fact that you can 
sometimes display it is because internally R uses UTF-8 as much as it 
can, and it can represent the character.

One fix for this is to switch from Windows to some other OS.  The others 
all have proper support for UTF-8.

You might have luck changing your Windows code page to one that includes 
the ?, but then there'll be some other characters that are missed.

You should definitely investigate Eberhard's advice, and test non-base 
packages like readr.  They are all written much more recently than the 
base functions, and might have proper support for out-of-code-page 
characters.

Duncan Murdoch
On 20/10/2020 8:20 a.m., Jinsong Zhao wrote:
#
Thank you very much for the hint. I tried it on a FreeBSD machine with 
locale set to en_US.UTF-8, it works fine.

However, on my Windows machine,
 > Sys.getlocale()
[1] "LC_COLLATE=Chinese (Simplified)_China.936;LC_CTYPE=Chinese 
(Simplified)_China.936;LC_MONETARY=Chinese 
(Simplified)_China.936;LC_NUMERIC=C;LC_TIME=Chinese (Simplified)_China.936"

It just worked as what I posted.

BTW, I can not understand why a string could be displayed different as 
vector or as data frame.

Best,
Jinsong
On 2020/10/20 21:56, John Kane wrote:
#
Hi,

One additional option that you might want to look at is to use ?writeLines with 'useBytes = TRUE', where the default is FALSE.

Windows, as Duncan notes, is problematic with extended encodings, and you can actually get conflicted encoding of text, based upon what is used within R, versus the local system encoding set by the OS.

There is an added step of complexity with writeLines(), of having to pre-format the line(s) to be output to conform to CSV required formatting. So you would need to paste() together each output line first using field delimiters, double quotes, etc. prior to output. 

Essentially, mimic the default formatting of write.csv(), on a line by line basis, and then output the resulting object to a text file, with a single call to writeLines().

Regards,

Marc Schwartz