Skip to content
Prev 61298 / 63421 Next

Sys.getenv(): Error in substring(x, m + 1L) : invalid multibyte string at '<ff>' if an environment variable contains \xFF

On 1/31/23 01:27, Simon Urbanek wrote:
Well, yes, dropping such variables is a hack/work-around, but I would 
not rule that out from consideration. It would be good to have a look 
hat even other language runtimes do, but I can understand dropping them 
from getenv() a work-around allowing such environment variables to exist 
and be inherited by child processes, without breaking what happens in R 
(the language runtime at hand), makes some sense.

The times when we could have (almost) arbitrary bytes in blobs are over, 
one has to choose and separate the two. Going back to that we allow 
blobs in multi-byte strings without special treatment is not possible. 
That's a common thing, not just for R, we can't just "stop checking" as 
before.

R has also the "bytes" encoding which can be used to work with 
non-strings (encoding agnostic operations on non-ASCII data), and there 
have been a number of improvements that are now ready in R-devel. In 
theory, of course, getenv() could allow also to return the results as 
"bytes" for those who insist they want to treat environment variables 
with invalid strings. That would be technically possible, but could not 
be made the default: people would have to opt-in for that by an argument 
to the function. I am not sure it is worth it, it would not do what 
Henrik is asking for, but it is possible and would be a conceptually 
sound solution allowing people to use such variables (with newly written 
code for it that cannot just use the values as "normal" strings), or to 
save the whole environment profile.

Some of us have spent very long time debating how to go further dealing 
with invalid strings, and it will still take time to decide and reach a 
consensus. It is a very hard problem with serious consequences and 
limitations. In principle one possible solution is to ban invalid 
strings fully, completely, don't allow them to be created. Another 
technically valid position is to allow invalid sequences in UTF-8 
strings and support a subset of string operations in them, while 
throwing errors in other (probably only after/when/if R can rely on that 
UTF-8 is always the native encoding). Another approach discussed was 
what some of the regular expressions do, when invalid strings 
automatically become "bytes", to some extent it is the current behavior 
of some regex operations. The improvements for "bytes" encoding handling 
so far in R-devel a consequence of what we have already agreed on.

So, it is not hard today to find inconsistencies in R in that some 
functions check for string validity while others don't. It is simply 
because we have not yet gone all the way to a solution to this big 
problem. getenv() is just one of the many.

In case of getenv(), indeed, it is because of how it is implemented, it 
wraps OS functions and the OS don't require string validity. Btw Windows 
have both a single-byte and multi-byte variant of the environment 
profile which may disagree. R is not at the source of this problem, it 
is the operating systems which couldn't (yet?) find the decision and 
where the ambiguity remains.

In practice, so far, the problem is small, because [1] (reported by 
Henrik) is behavior due to an obvious design bug in other software, and 
luckily this is rare.
The key design decision (and common one) behind is that environment 
variables are strings.

Tomas

Thread (13 messages)

Henrik Bengtsson Sys.getenv(): Error in substring(x, m + 1L) : invalid multibyte string at '<ff>' if an environment variable contains \xFF Jan 30 Tomas Kalibera Sys.getenv(): Error in substring(x, m + 1L) : invalid multibyte string at '<ff>' if an environment variable contains \xFF Jan 30 Simon Urbanek Sys.getenv(): Error in substring(x, m + 1L) : invalid multibyte string at '<ff>' if an environment variable contains \xFF Jan 30 Henrik Bengtsson Sys.getenv(): Error in substring(x, m + 1L) : invalid multibyte string at '<ff>' if an environment variable contains \xFF Jan 30 Simon Urbanek Sys.getenv(): Error in substring(x, m + 1L) : invalid multibyte string at '<ff>' if an environment variable contains \xFF Jan 30 Ivan Krylov Sys.getenv(): Error in substring(x, m + 1L) : invalid multibyte string at '<ff>' if an environment variable contains \xFF Jan 31 Tomas Kalibera Sys.getenv(): Error in substring(x, m + 1L) : invalid multibyte string at '<ff>' if an environment variable contains \xFF Jan 31 Tomas Kalibera Sys.getenv(): Error in substring(x, m + 1L) : invalid multibyte string at '<ff>' if an environment variable contains \xFF Jan 31 Martin Maechler Sys.getenv(): Error in substring(x, m + 1L) : invalid multibyte string at '<ff>' if an environment variable contains \xFF Jan 31 Duncan Murdoch Sys.getenv(): Error in substring(x, m + 1L) : invalid multibyte string at '<ff>' if an environment variable contains \xFF Jan 31 Tomas Kalibera Sys.getenv(): Error in substring(x, m + 1L) : invalid multibyte string at '<ff>' if an environment variable contains \xFF Jan 31 Peter Dalgaard Sys.getenv(): Error in substring(x, m + 1L) : invalid multibyte string at '<ff>' if an environment variable contains \xFF Jan 31 Tomas Kalibera Sys.getenv(): Error in substring(x, m + 1L) : invalid multibyte string at '<ff>' if an environment variable contains \xFF Jan 31