Skip to content
Prev 59684 / 63433 Next

NEWS item for bugfix in normalizePath and file.exists?

Hi Toby,

a defensive, portable approach would be to use only file names regarded 
portable by POSIX, so characters including ASCII letters, digits, 
underscore, dot, hyphen (but hyphen should not be the first character). 
That would always work on all systems and this is what I would use.

Individual operating systems and file systems and their configurations 
differ in which additional characters they support and how. On some, 
file names are just sequences of bytes, on some, they have to be valid 
strings in certain encoding (and then with certain exceptions).

On Windows, file names are at the lowest level in UTF-16LE encoding (and 
admitting unpaired surrogates for historical reasons). R stores strings 
in other encodings (UTF-8, native, Latin-1), so file names have to be 
translated to/from UTF-16LE, either directly by R or by Windows.

But, there is no way to convert (non-ASCII) strings in "C" encoding to 
UTF16-LE, so the examples cannot be made to work on Windows.

When the translation is left on Windows, it assumes the non-UTF-16LE 
strings are in the Active Code Page encoding (shown as "system encoding" 
in sessionInfo() in R, Latin-1 in your example) instead of the current C 
library encoding ("C" in your example). So, file names coming from 
Windows will be either the bytes of their UTF-16LE representation or the 
bytes of their Latin-1 representation, but which one is subject to the 
implementation details, so the result is really unusable.

I would say using "C" as encoding in R is not a good idea, and 
particularly not on Windows.

I would say that what happens with such file names in "C" encoding is 
unspecified behavior, which is subject to change at any time without 
notice, and that both the R 4.0.5 and R-devel behavior you are observing 
are acceptable. I don't think it should be mentioned in the NEWS. 
Personally, I would prefer some stricter checks of strings validity and 
perhaps disallowing the "C" encoding in R, so yet another behavior where 
it would be clearer that this cannot really work, but that would require 
more thought and effort.

Best
Tomas
On 4/27/21 9:53 PM, Toby Hocking wrote: