write.csv covert Åland to <c5>land

10 messages · Dr Eberhard W Lisse, Jinsong Zhao, John Kane +2 more

Original

1

10

Jinsong Zhao

Tue, Oct 20, 2020 2:13 AM #

Hi there,

I tried to export the names of country to a csv file with write.csv(). 
In the resulted file, ?land was coverted to <c5>land. Is there any way 
could prevent this happening? Thanks!

 > abc
[1] "?land"
 > write.table(abc, file = "")
"x"
"1" "<c5>land"

Best,
Jinsong

Dr Eberhard W Lisse

Tue, Oct 20, 2020 2:23 AM #

?file.write()

look for fileEncoding?

el

On 20/10/2020 11:13, Jinsong Zhao wrote:

______________________________________________
R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Dr. Eberhard W. Lisse   \         /       Obstetrician & Gynaecologist 
el at lisse.NA             / *      |  Telephone: +264 81 124 6733 (cell)
PO Box 8421 Bachbrecht  \      /  If this email is signed with GPG/PGP
10007, Namibia           ;____/ Sect 20 of Act No. 4 of 2019 may apply

Jinsong Zhao

Tue, Oct 20, 2020 3:38 AM #

On 2020/10/20 17:23, Dr Eberhard W Lisse wrote:

There is no file.write(). I have tried fileEncoding = "utf8" and 
"latin1" in write.csv(). However, it does not have effect. The output is 
is <U+00C5>land or <c5>land.

Best,
Jinsong

______________________________________________
R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Dr Eberhard W Lisse

Tue, Oct 20, 2020 3:45 AM #

Apologies, 

I meant

?write.table()

el

On 20/10/2020 12:38, Jinsong Zhao wrote:

[...]

Dr. Eberhard W. Lisse   \         /       Obstetrician & Gynaecologist 
el at lisse.NA             / *      |  Telephone: +264 81 124 6733 (cell)
PO Box 8421 Bachbrecht  \      /  If this email is signed with GPG/PGP
10007, Namibia           ;____/ Sect 20 of Act No. 4 of 2019 may apply

Dr Eberhard W Lisse

Tue, Oct 20, 2020 3:48 AM #

Perhaps

?readr::write_delim()

el

On 20/10/2020 12:45, Dr Eberhard W Lisse wrote:

Dr. Eberhard W. Lisse   \         /       Obstetrician & Gynaecologist 
el at lisse.NA             / *      |  Telephone: +264 81 124 6733 (cell)
PO Box 8421 Bachbrecht  \      /  If this email is signed with GPG/PGP
10007, Namibia           ;____/ Sect 20 of Act No. 4 of 2019 may apply

Jinsong Zhao

Tue, Oct 20, 2020 5:20 AM #

Hi there,

Why the same string is displayed in different form?

 > abc[,1]
[1] "?land"       "Afghanistan"
 > abc
          name
1    <c5>land
2 Afghanistan

And more...

 > dput(abc, "aa.txt")
 > dget("aa.txt")
          name
1    <c5>land
2 Afghanistan
 > dget("aa.txt")[,1]
[1] "<c5>land"    "Afghanistan"

Best,
Jinsong

On 2020/10/20 17:13, Jinsong Zhao wrote:

______________________________________________
R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

John Kane

Tue, Oct 20, 2020 6:56 AM #

It looks like an encoding problem.

It works fine for me with R encoding set to UTF-8

Here is part of my sessionInfo() results
[1] LC_CTYPE=en_CA.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_CA.UTF-8        LC_COLLATE=en_CA.UTF-8
 [5] LC_MONETARY=en_CA.UTF-8    LC_MESSAGES=en_CA.UTF-8

I would suggest issuing the command
sessionInfo()
and seeing what your encoding is.

On Tue, 20 Oct 2020 at 08:22, Jinsong Zhao <jszhao at yeah.net> wrote:

______________________________________________
R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

John Kane
Kingston ON Canada

	[[alternative HTML version deleted]]

Tue, Oct 20, 2020 7:28 AM #

You don't say, but I'd guess you're using Windows.  In your code page, 
the character ? is probably not representable.  At some point in the 
sequence of operations involved in printing the dataframe R puts the 
string into the native encoding, and since that's impossible on your 
system, it substitutes the <c5> instead.  The fact that you can 
sometimes display it is because internally R uses UTF-8 as much as it 
can, and it can represent the character.

One fix for this is to switch from Windows to some other OS.  The others 
all have proper support for UTF-8.

You might have luck changing your Windows code page to one that includes 
the ?, but then there'll be some other characters that are missed.

You should definitely investigate Eberhard's advice, and test non-base 
packages like readr.  They are all written much more recently than the 
base functions, and might have proper support for out-of-code-page 
characters.

Duncan Murdoch

On 20/10/2020 8:20 a.m., Jinsong Zhao wrote:

______________________________________________
R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Jinsong Zhao

Tue, Oct 20, 2020 7:31 AM #

Thank you very much for the hint. I tried it on a FreeBSD machine with 
locale set to en_US.UTF-8, it works fine.

However, on my Windows machine,
 > Sys.getlocale()
[1] "LC_COLLATE=Chinese (Simplified)_China.936;LC_CTYPE=Chinese 
(Simplified)_China.936;LC_MONETARY=Chinese 
(Simplified)_China.936;LC_NUMERIC=C;LC_TIME=Chinese (Simplified)_China.936"

It just worked as what I posted.

BTW, I can not understand why a string could be displayed different as 
vector or as data frame.

Best,
Jinsong

On 2020/10/20 21:56, John Kane wrote:

Tue, Oct 20, 2020 8:03 AM #

Hi,

One additional option that you might want to look at is to use ?writeLines with 'useBytes = TRUE', where the default is FALSE.

Windows, as Duncan notes, is problematic with extended encodings, and you can actually get conflicted encoding of text, based upon what is used within R, versus the local system encoding set by the OS.

There is an added step of complexity with writeLines(), of having to pre-format the line(s) to be output to conform to CSV required formatting. So you would need to paste() together each output line first using field delimiters, double quotes, etc. prior to output. 

Essentially, mimic the default formatting of write.csv(), on a line by line basis, and then output the resulting object to a text file, with a single call to writeLines().

Regards,

Marc Schwartz

On Oct 20, 2020, at 10:28 AM, Duncan Murdoch <murdoch.duncan at gmail.com> wrote:

You don't say, but I'd guess you're using Windows.  In your code page, the character ? is probably not representable.  At some point in the sequence of operations involved in printing the dataframe R puts the string into the native encoding, and since that's impossible on your system, it substitutes the <c5> instead.  The fact that you can sometimes display it is because internally R uses UTF-8 as much as it can, and it can represent the character.

One fix for this is to switch from Windows to some other OS.  The others all have proper support for UTF-8.

You might have luck changing your Windows code page to one that includes the ?, but then there'll be some other characters that are missed.

You should definitely investigate Eberhard's advice, and test non-base packages like readr.  They are all written much more recently than the base functions, and might have proper support for out-of-code-page characters.

Duncan Murdoch

On 20/10/2020 8:20 a.m., Jinsong Zhao wrote:

Hi there,
Why the same string is displayed in different form?

 > abc[,1]

[1] "?land"       "Afghanistan"

 > abc

          name
1    <c5>land
2 Afghanistan
And more...

 > dput(abc, "aa.txt")
 > dget("aa.txt")

          name
1    <c5>land
2 Afghanistan

 > dget("aa.txt")[,1]

[1] "<c5>land"    "Afghanistan"
Best,
Jinsong
On 2020/10/20 17:13, Jinsong Zhao wrote:

Hi there,

I tried to export the names of country to a csv file with write.csv().
In the resulted file, ?land was coverted to <c5>land. Is there any way
could prevent this happening? Thanks!

 > abc

[1] "?land"

 > write.table(abc, file = "")

"x"
"1" "<c5>land"

Best,
Jinsong