Skip to content

source(echo = TRUE) with a iso-8859-1 encoded file gives an error

3 messages · Ista Zahn, Scott Kostyshak

#
I have very little knowledge about file encodings and would like to
learn more.

I've read the following pages to learn more:

  http://stat.ethz.ch/R-manual/R-devel/library/base/html/Encoding.html
  https://stackoverflow.com/questions/4806823/how-to-detect-the-right-encoding-for-read-csv
  https://developer.r-project.org/Encodings_and_R.html

The last one, in particular, has been very helpful. I would be
interested in any further references that you suggest.

I attach a file that reproduces the issue I would like to learn more
about. I do not know if the file encoding will be correctly preserved
through email, so I also provide the file (temporarily) on Dropbox here:

  https://www.dropbox.com/s/3lbgebk7b5uaia7/encoding_export_issue.R?dl=0

The file gives an error when using "source()" with the
argument echo = TRUE:

  > source("encoding_export_issue.R", echo = TRUE)
  Error in nchar(dep, "c") : invalid multibyte string, element 1
  In addition: Warning message:
  In grepl("^[[:blank:]]*$", dep[1L]) :
    input string 1 is invalid in this locale

The problem comes from the "?" character in the .R file. The file
appears to be encoded as "iso-8859-1":

  $ file --mime-encoding encoding_export_issue.R 
  encoding_export_issue.R: iso-8859-1

Note that for me:

  > getOption("encoding")
  [1] "native.enc"

so "native.enc" is used for the "encoding" argument of source().

The following two calls succeed:

  > source("encoding_export_issue.R", echo = TRUE, encoding = "unknown")
  > source("encoding_export_issue.R", echo = TRUE, encoding = "iso-8859-1")

Is this file a valid "iso-8859-1" encoded file?  Why does source() fail
in the case of encoding set to "native.enc"? Is it because of the
settings to UTF-8 in my locale (see info on my system at the bottom of
this email).

I'm guessing it would be a bad idea to put

  options(encoding = "unknown")

in my .Rprofile, because it is difficult to always correctly guess the
encoding of files? Is there a reason why setting it to "unknown" would
lead to more problems than leaving it set to "native.enc"?

I've reproduced the above behavior on R-devel (r74677) and 3.4.3. Below
is my session info and locale info for my system with the 3.4.3 version:
R version 3.4.3 (2017-11-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.3 LTS

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] compiler_3.4.3
[1] "LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=en_US.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C"

Thanks for your time,

Scott

P.S. Note that I had posted this question to r-devel, which was the
incorrect choice. For archival purposes, I reference the thread here:

https://www.mail-archive.com/search?l=mid&q=20180501185750.445oub53vcdnyyyx%40steph
#
On Fri, May 4, 2018 at 4:47 PM, Scott Kostyshak <skostyshak at ufl.edu> wrote:
The one you attached is not. The one linked to in dropbox is.

 Why does source() fail
Yes.
My guess is that the issue is less about the difficulty of guessing
the encoding, and more about the time it takes to do so. That's not
particularly relevant for the "source" function, but the encoding
option is used by many of the file IO functions in R and so has
implications well beyond the behavior of "source".

 Is there a reason why setting it to "unknown" would
It depends on what you are actually doing. If you are on a UTF-8
locale and working exclusively with UTF-8 files, setting
options(encoding = "unknown") will just slow down your file IO by
checking for the encoding every time.
#
On Fri, May 04, 2018 at 10:58:26PM +0000, Ista Zahn wrote:
Ah I did not think about this possibility. Makes sense.
Good to know. Thank you for your response, Ista.

Scott