Skip to content

ASCIIfy() - a proposal for package:tools

5 messages · Arni Magnusson, Gregory R. Warnes, Duncan Murdoch

#
Hi all,

I would like to propose the attached function ASCIIfy() to be added to the 
'tools' package.

Non-ASCII characters in character vectors can be problematic for R 
packages, but sometimes they cannot be avoided. To make packages portable 
and build without 'R CMD check' warnings, my solution has been to convert 
problematic characters in functions and datasets to escaped ASCII, so 
plot(1,main="S?o Paulo") becomes plot(1,main="S\u00e3o Paulo").

The showNonASCII() function in package:tools is helpful to identify R 
source files where characters should be converted to ASCII one way or 
another, but I could not find a function to actually perform the 
conversion to ASCII.

I have written the function ASCIIfy() to convert character vectors to 
ASCII. I imagine other R package developers might be looking for a similar 
tool, and it seems to me that package:tools is the first place they would 
look, where the R Core Team has provided a variety of tools for handling 
non-ASCII characters in package development.

I hope the R Core Team will adopt ASCIIfy() into the 'tools' package, to 
make life easier for package developers outside the English-speaking 
world. I have of course no problem with them renaming or rewriting the 
function in any way.

See the attached examples - all in flat ASCII that was prepared using the 
function itself! The main objective, though, is to ASCIIfy functions and 
datasets, not help pages.

Arni
-------------- next part --------------
ASCIIfy <- function(string, bytes=2, fallback="?")
{
  bytes <- match.arg(as.character(bytes), 1:2)
  convert <- function(char)  # convert to ASCII, e.g. "z", "\xfe", or "\u00fe"
  {
    raw <- charToRaw(char)
    if(length(raw)==1 && raw<=127)  # 7-bit
      ascii <- char
    else if(length(raw)==1 && bytes==1)  # 8-bit to \x00
      ascii <- paste0("\\x", raw)
    else if(length(raw)==1 && bytes==2)  # 8-bit to \u0000
      ascii <- paste0("\\u", chartr(" ","0",formatC(as.character(raw),width=4)))
    else if(length(raw)==2 && bytes==1)  # 16-bit to \x00, if possible
      if(utf8ToInt(char) <= 255)
        ascii <- paste0("\\x", format.hexmode(utf8ToInt(char)))
      else {
        ascii <- fallback; warning(char, " could not be converted to 1 byte")}
    else if(length(raw)==2 && bytes==2)  # UTF-8 to \u0000
      ascii <- paste0("\\u", format.hexmode(utf8ToInt(char),width=4))
    else {
      ascii <- fallback
      warning(char, " could not be converted to ", bytes, " byte")}
    return(ascii)
  }

  if(length(string) > 1)
  {
    sapply(string, ASCIIfy, bytes=bytes, fallback=fallback, USE.NAMES=FALSE)
  }
  else
  {
    input <- unlist(strsplit(string,""))  # "c"  "a"  "f"  "<\'e>"
    output <- character(length(input))    # ""   ""   ""   ""
    for(i in seq_along(input))
      output[i] <- convert(input[i])      # "c"  "a"  "f"  "\\u00e9"
    output <- paste(output, collapse="")  # "caf\\u00e9"
    return(output)
  }
}
-------------- next part --------------
\name{ASCIIfy}
\alias{ASCIIfy}
\title{Convert Characters to ASCII}
\description{
  Convert character vector to ASCII, replacing non-ASCII characters with
  single-byte (\samp{\x00}) or two-byte (\samp{\u0000}) codes.
}
\usage{
ASCIIfy(x, bytes = 2, fallback = "?")
}
\arguments{
  \item{x}{a character vector, possibly containing non-ASCII
    characters.}
  \item{bytes}{either \code{1} or \code{2}, for single-byte
    (\samp{\x00}) or two-byte (\samp{\u0000}) codes.}
  \item{fallback}{an output character to use, when input characters
    cannot be converted.}
}
\value{
  A character vector like \code{x}, except non-ASCII characters have
  been replaced with \samp{\x00} or \samp{\u0000} codes.
}
\author{Arni Magnusson.}
\note{
  To render single backslashes, use these or similar techniques:
  \verb{
    write(ASCIIfy(x), "file.txt")
    cat(paste(ASCIIfy(x), collapse="\n"), "\n", sep="")}

  The resulting strings are plain ASCII and can be used in R functions
  and datasets to improve package portability.
}
\seealso{
  \code{\link[tools]{showNonASCII}} identifies non-ASCII characters in
  a character vector.
}
\examples{
cities <- c("S\u00e3o Paulo", "Reykjav\u00edk")
print(cities)
ASCIIfy(cities, 1)
ASCIIfy(cities, 2)

athens <- "\u0391\u03b8\u03ae\u03bd\u03b1"
print(athens)
ASCIIfy(athens)
}
\keyword{}
1 day later
#
Nobody else has replied to this, so I will.  It's very unlikely that we 
would incorporate this function into base R.  For one thing, the tools 
package is intended to be tools used by R, not by users.  R doesn't need 
this function, so it doesn't belong in tools.  (Some other functions in 
tools like showNonASCII have come to be used by users, but their primary 
purpose is for R.)

Utility functions that are maintained by R Core and are useful to users 
belong in the utils package.  But I wouldn't add ASCIIfy to that package 
either, because I don't want to impose its maintenance on R Core.

Utility functions that are maintained by others belong in contributed 
packages.  So I'd suggest that you add this function to some package 
that you maintain (perhaps a new one, containing a collection of related 
utility functions), or search CRAN for an appropriate package with a 
maintainer who is willing to take this on.

Duncan Murdoch
On 15/04/2014 1:48 PM, Arni Magnusson wrote:
#
Hi Arni,

I?ll be glad to drop ASCIIfy into gtools.  Let me know if this OK.

-Greg
On Apr 17, 2014, at 9:46 AM, Duncan Murdoch <murdoch.duncan at gmail.com> wrote:

            
#
On 17/04/2014 12:47 PM, Gregory R. Warnes wrote:
Thanks, that sounds like a great solution if Arni doesn't want his own 
package.

Duncan Murdoch
#
Thanks Duncan, for considering ASCIIfy. I understand your reasoning.

This is a recurring pattern - I propose functions for core R, and Greg 
catches them from freefall :)

I'm delighted with ASCIIfy being hosted in gtools. The R and Rd should be 
ready as is.

Cheers,

Arni