NEWS item for bugfix in normalizePath and file.exists?
Hi Toby,
On 4/28/21 4:21 PM, Toby Hocking wrote:
Hi Tomas, thanks for the thoughtful reply. That makes sense about the problems with C locale on windows. Actually I did not choose to use C locale, but instead it was invoked automatically during a package check.
I see, as long as the tests only have ASCII strings, the encoding does not matter, but once there are also other characters, I think we should be running with some real encoding, and one where the characters can be represented. Best, Tomas
To be clear, I do NOT have a file with that name, but I do want file.exists to return a reasonable value, FALSE (with no error). If that behavior is unspecified, then should I use something like tryCatch(file.exists(x), error=function(e)FALSE) instead of assuming that file.exists will always return a logical vector without error? For my particular application that work-around should probably be sufficient, but one may imagine a situation where you want to do x <- "\360\237\247\222\n| \360\237\247\222\360\237\217\273\n| \360\237\247\222\360\237\217\274\n| \360\237\247\222\360\237\217\275\n| \360\237\247\222\360\237\217\276\n| \360\237\247\222\360\237\217\277\n" Encoding(x) <- "unknown"
Sys.setlocale(locale="C")
f <- tempfile()
cat("", file = f)
two <- c(x, f)
file.exists(two)
and in that case the correct response from R, in my opinion, would be
c(FALSE, TRUE) -- not an error.
Toby
On Wed, Apr 28, 2021 at 3:10 AM Tomas Kalibera
<tomas.kalibera at gmail.com <mailto:tomas.kalibera at gmail.com>> wrote:
Hi Toby,
a defensive, portable approach would be to use only file names
regarded
portable by POSIX, so characters including ASCII letters, digits,
underscore, dot, hyphen (but hyphen should not be the first
character).
That would always work on all systems and this is what I would use.
Individual operating systems and file systems and their
configurations
differ in which additional characters they support and how. On some,
file names are just sequences of bytes, on some, they have to be
valid
strings in certain encoding (and then with certain exceptions).
On Windows, file names are at the lowest level in UTF-16LE
encoding (and
admitting unpaired surrogates for historical reasons). R stores
strings
in other encodings (UTF-8, native, Latin-1), so file names have to be
translated to/from UTF-16LE, either directly by R or by Windows.
But, there is no way to convert (non-ASCII) strings in "C"
encoding to
UTF16-LE, so the examples cannot be made to work on Windows.
When the translation is left on Windows, it assumes the non-UTF-16LE
strings are in the Active Code Page encoding (shown as "system
encoding"
in sessionInfo() in R, Latin-1 in your example) instead of the
current C
library encoding ("C" in your example). So, file names coming from
Windows will be either the bytes of their UTF-16LE representation
or the
bytes of their Latin-1 representation, but which one is subject to
the
implementation details, so the result is really unusable.
I would say using "C" as encoding in R is not a good idea, and
particularly not on Windows.
I would say that what happens with such file names in "C" encoding is
unspecified behavior, which is subject to change at any time without
notice, and that both the R 4.0.5 and R-devel behavior you are
observing
are acceptable. I don't think it should be mentioned in the NEWS.
Personally, I would prefer some stricter checks of strings
validity and
perhaps disallowing the "C" encoding in R, so yet another behavior
where
it would be clearer that this cannot really work, but that would
require
more thought and effort.
Best
Tomas
On 4/27/21 9:53 PM, Toby Hocking wrote:
> Hi all, Today I noticed bug(s?) in R-4.0.5, which seem to be
fixed in
> R-devel already. I checked on
> https://developer.r-project.org/blosxom.cgi/R-devel/NEWS
<https://developer.r-project.org/blosxom.cgi/R-devel/NEWS> and there is no
> mention of these changes, so I'm wondering if they are
intentional? If so,
> could someone please add a mention of the bugfix in the NEWS?
>
> The problem involves file.exists, on windows, when a
long/strange input
> file name Encoding is unknown, in C locale. I expected that
FALSE should be
> returned (and it is on R-devel), but I got an error in R-4.0.5.
Code to
> reproduce is:
>
> x <- "\360\237\247\222\n| \360\237\247\222\360\237\217\273\n|
> \360\237\247\222\360\237\217\274\n|
\360\237\247\222\360\237\217\275\n|
> \360\237\247\222\360\237\217\276\n|
\360\237\247\222\360\237\217\277\n"
> Encoding(x) <- "unknown"
> Sys.setlocale(locale="C")
> sessionInfo()
> file.exists(x)
>
> Output I got from R-4.0.5 was
>
>> sessionInfo()
> R version 4.0.5 (2021-03-31)
> Platform: x86_64-w64-mingw32/x64 (64-bit)
> Running under: Windows 10 x64 (build 19042)
>
> Matrix products: default
>
> locale:
> [1] C
> system code page: 1252
>
> attached base packages:
> [1] stats? ? ?graphics? grDevices utils? ? ?datasets methods? ?base
>
> loaded via a namespace (and not attached):
> [1] compiler_4.0.5
>> file.exists(x)
> Error in file.exists(x) : file name conversion problem -- name
too long?
> Execution halted
>
> Output I got from R-devel was
>
>> sessionInfo()
> R Under development (unstable) (2021-04-26 r80229)
> Platform: x86_64-w64-mingw32/x64 (64-bit)
> Running under: Windows 10 x64 (build 19042)
>
> Matrix products: default
>
> locale:
> [1] C
>
> attached base packages:
> [1] stats? ? ?graphics? grDevices utils? ? ?datasets methods? ?base
>
> loaded via a namespace (and not attached):
> [1] compiler_4.2.0
>> file.exists(x)
> [1] FALSE
>
> I also observed similar results when using normalizePath instead of
> file.exists (error in R-4.0.5, no error in R-devel).
>
>> normalizePath(x) #R-4.0.5
> Error in path.expand(path) : unable to translate 'p'
> | p'p;
> | p'p<
> | p'p=
> | p'p>
> | p'p<bf>
> ' to UTF-8
> Calls: normalizePath -> path.expand
> Execution halted
>
>> normalizePath(x) #R-devel
> [1] "C:\\Users\\th798\\R\\\360\237\247\222\n|
> \360\237\247\222\360\237\217\273\n|
\360\237\247\222\360\237\217\274\n|
> \360\237\247\222\360\237\217\275\n|
\360\237\247\222\360\237\217\276\n|
> \360\237\247\222\360\237\217\277\n"
> Warning message:
> In normalizePath(path.expand(path), winslash, mustWork) :
path[1]="?
> | ??
> | ??
> | ??
> | ??
> | ??
> ": The filename, directory name, or volume label syntax is incorrect
>
>? ? ? ?[[alternative HTML version deleted]]
>
> ______________________________________________
> R-devel at r-project.org <mailto:R-devel at r-project.org> mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel