Skip to content

Embedded nuls in strings

7 messages · Hervé Pagès, Steven McKinney, Duncan Murdoch

#
Hi,

?rawToChar
     'rawToChar' converts raw bytes either to a single character string
     or a character vector of single bytes.  (Note that a single
     character string could contain embedded nuls.)

Allowing embedded nuls in a string might be an interesting experiment but it
seems to cause some troubles to most of the string manipulation functions.

A string with an embedded 0:

  raw0 <- as.raw(c(65:68, 0 , 70))
  string0 <- rawToChar(raw0)
[1] "ABCD\0F"

nchar() should return 6:
[1] 4

In addition this embedded nul seems to break almost all string manipulation/searching
functions:
  grep("F", string0)
  strsplit(string0, split=NULL, fixed=TRUE)[[1]]
  tolower(string0)
  chartr("F", "x", string0)
  substr(string0, 6, 6)
  ...
  etc...

Not very surprisingly, they all seem to treat string0 as if it was "ABCD"!

Cheers,
H.
#
I get similar results on an Apple Mac G5
running OS X, though nchar() works.
[1] 41 42 43 44 00 46
[1] "ABCD\0F"
[1] 6
integer(0)
[1] "A" "B" "C" "D"
[1] "abcd"
[1] "ABCD"
[1] ""
R version 2.5.1 (2007-06-27) 
powerpc-apple-darwin8.9.1 

locale:
en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8

attached base packages:
[1] "splines"   "stats"     "graphics"  "grDevices" "utils"     "datasets"  "methods"   "base"
Steven McKinney

Statistician
Molecular Oncology and Breast Cancer Program
British Columbia Cancer Research Centre

email: smckinney +at+ bccrc +dot+ ca

tel: 604-675-8000 x7561

BCCRC
Molecular Oncology
675 West 10th Ave, Floor 4
Vancouver B.C. 
V5Z 1L3
Canada




-----Original Message-----
From: r-devel-bounces at r-project.org on behalf of Herve Pages
Sent: Tue 8/7/2007 2:06 PM
To: r-devel at r-project.org
Subject: [Rd] Embedded nuls in strings
 
Hi,

?rawToChar
     'rawToChar' converts raw bytes either to a single character string
     or a character vector of single bytes.  (Note that a single
     character string could contain embedded nuls.)

Allowing embedded nuls in a string might be an interesting experiment but it
seems to cause some troubles to most of the string manipulation functions.

A string with an embedded 0:

  raw0 <- as.raw(c(65:68, 0 , 70))
  string0 <- rawToChar(raw0)
[1] "ABCD\0F"

nchar() should return 6:
[1] 4

In addition this embedded nul seems to break almost all string manipulation/searching
functions:
  grep("F", string0)
  strsplit(string0, split=NULL, fixed=TRUE)[[1]]
  tolower(string0)
  chartr("F", "x", string0)
  substr(string0, 6, 6)
  ...
  etc...

Not very surprisingly, they all seem to treat string0 as if it was "ABCD"!

Cheers,
H.

______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
#
On 07/08/2007 5:06 PM, Herve Pages wrote:
You don't state your R version.  The default type of counting in nchar() 
has recently changed from "bytes" (where 6 is correct) to "chars" (where 
4 is correct).

Duncan Murdoch
#
Duncan Murdoch wrote:
Oops, sorry:
R version 2.6.0 Under development (unstable) (2007-07-02 r42107)
x86_64-unknown-linux-gnu

locale:
LC_CTYPE=en_US;LC_NUMERIC=C;LC_TIME=en_US;LC_COLLATE=en_US;LC_MONETARY=en_US;LC_MESSAGES=en_US;LC_PAPER=en_US;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US;LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

loaded via a namespace (and not attached):
[1] rcompgen_0.1-15


And indeed:
  raw0 <- as.raw(c(65:68, 0 , 70))
  string0 <- rawToChar(raw0)
[1] 4
[1] 6


In addition to the string functions already mentioned before, it's worth noting that
'paste' doesn't seem to be "embedded nul aware" neither:
[1] "ABCDG"

Same for serialization:
[1] "ABCD"

One comment about the nchar man page:
  'chars' The number of human-readable characters.

"human-readable" seems to be used for "everything but a nul" here which can be confusing.
For example one would generally think of ascii codes 1 to 31 as non "human-readable" but
nchar() seems to disagree:
[1]
"\001\002\003\004\005\006\a\b\t\n\v\f\r\016\017\020\021\022\023\024\025\026\027\030\031\032\033\034\035\036\037"
[1] 31


Cheers,
H.
#
On 07/08/2007 6:29 PM, Herve Pages wrote:
Of these, I'd say the serialization is the only case where it would be 
reasonable to fix the behaviour.  R depends on C run-time functions for 
most of the string operations, and they'll stop at a null.  So if this 
isn't documented behaviour, it should be, but it's not reasonable to 
rewrite the C run-time string functions just to handle such weird 
objects.  Functions like "grep" require thousands of lines of code, not 
written by us, and in my opinion maintaining changes to it is not 
something the R project should take on.

As to serialization:  there's a comment in the source that embedded 
nulls are handled by it, and that's true up to R-patched, but not in 
R-devel.  Looks like someone has introduced a bug.

Duncan Murdoch
No, "human-readable" also has other meanings in multi-byte encodings. 
If an e-acute is encoded in two bytes in your locale, it still only 
counts as one human-readable character.
#
Duncan Murdoch wrote:
[...]
I was not (of course) suggesting to fix all the string manipulation functions.
I'm just wondering why R would try to support embedded nuls in the first
place given that they can only be a source of troubles.

What about this:

  > string0
  [1] "ABCD\0F"
  > string0 == "ABCD"
  [1] TRUE

string0 is obviously different from "ABCD"!

Maybe it's easier to change the semantic of rawToChar() so it doesn't return
a string with embedded nuls. More generally speaking, base functions should
always return "clean" strings.
#
On 07/08/2007 9:13 PM, Herve Pages wrote:
I think this predates raw vectors, so this would have been the only way 
to handle strings with embedded nulls.  C has problems with those, but 
not all other languages do.
This is documented behaviour, from ?Comparison:

"When comparisons are made between character strings, parts of the
      strings after embedded 'nul' characters are ignored.  (This is
      necessary as the position of 'nul' in the collation sequence is
      undefined, and we want one of '<', '==' and '>' to be true for any
      comparison.)"

But notice

 > identical(string0, "ABCD")
[1] FALSE

This is documented as

      "Comparison of character strings allows for embedded 'nul'
      characters."

Duncan Murdoch