Skip to content

Suggestion: mkString(NULL) should be NA

13 messages · Jeroen Ooms, Gabriel Becker, William Dunlap +5 more

#
I would like to propose that Rf_mkString(NULL) and Rf_mkChar(NULL)
return NA rather than segfault.

Case: the mkString() and mkChar() functions are convenient to wrap
strings returned by e.g. external C libraries into an R vector.
However sometimes a library returns NULL instead of a string when the
result is unavailable. In some C libraries this can happen
unexpectedly or is even undocumented.

A good R package author always checks results for a null pointer, and
deals with it accordingly. But sometimes we make assumptions. There
was an example in the 'curl' package where a documented version string
was suddenly NULL if libcurl was built with some unusual
configuration. These problems are hard to catch and I don't see any
benefit of segfaulting for such edge cases.

Some packages use a macro like this to protect against such problems:

  #define make_string(x) x ? Rf_mkString(x) : ScalarString(NA_STRING)

But I think it would make sense if this was the default behavior in
Rf_mkString and Rf_mkChar.
12 days later
#
On Thu, May 12, 2016 at 1:20 PM, Jeroen Ooms <jeroen.ooms at stat.ucla.edu> wrote:
An example implementation: https://git.io/vroxm

With this patch, mkChar(NULL), mkCharCE(NULL, ce) would return
NA_STRING rather than segfault at strlen(NULL). This automatically
fixes mkString(NULL) as well which wraps mkChar (See Rinlinedfuns.h).
#
Shouldn't Rf_mkString(NULL) return (the c-level equivalent of) character()
rather than the NA_character_? An empty string and NULL aren't the same. It
seems reasonable for Rf_mkChar to give NA_character_ though.

~G

On Tue, May 24, 2016 at 8:42 AM, Jeroen Ooms <jeroen.ooms at stat.ucla.edu>
wrote:

  
    
#
Why should Rf_mkString(NULL) produce NA_STRING instead of ""
(R_BlankString)?  I prefer that passing in a nil pointer would cause
an error instead, as the nil may arise by accident, perhaps a pointer
to freed memory, and I would like to be notified that my code is bad instead
of getting a random NA_STRING.


Bill Dunlap
TIBCO Software
wdunlap tibco.com

On Tue, May 24, 2016 at 8:42 AM, Jeroen Ooms <jeroen.ooms at stat.ucla.edu>
wrote:

  
  
#
On Tue, May 24, 2016 at 5:59 PM, Gabriel Becker <gmbecker at ucdavis.edu> wrote:
No. It should still be safe to assume that mkString() always returns a
character vector of exactly length one. Anything else could lead to
type errors.
Exactly! So if you pass in an empty C string, you get an empty R
string, and if you pass in a null pointer you get NA.

Rf_mkString(NULL)   <--> NA
Rf_mkString("")   <--> ""

There is no ambiguity, and much better than segfaulting.
#
On Tue, May 24, 2016 at 9:30 AM, Jeroen Ooms <jeroen.ooms at stat.ucla.edu>
wrote:
Well the thing is you're passing an invalid pointer, that doesn't point to
a C string, to a constructor expecting a valid const char *. I'm fine with
the contract being that mkString always returns a character vector of
length one, but that doesn't necessarily mean that the function needs to
accept NULL pointers. The contract as I understand it is that if you give
it a C string, it will create a CHARSXP for that string. In this light,
Bill's suggestion that it throw an error seems the most principled
response. I would think you would need to at the very least emit a warning.
Well, better than segfaulting is not really relevant here. No one is
arguing that it should segfault. The question is what behavior it should
have when it doesn't segfault.

It's true that a C empty string is not the same as NULL, but NULL isn't the
same as NA either. Semantically, for your use-case (which I gather arose
from interactions we had :) ) the NULL means there is no version, while NA
indicates there is a version but we don't know what it is. Imagine an
object class that represents a persons name (first, middle, last). Now take
two people, One has no middle name (and we know that when creating the
object) and another for whom we don't have any information about the middle
name, only first and last were reported. I would expect the first one to
have middle name either NULL or (in a data.frame context) "", while the
second would have NA_character_. In this light, mkString should arguably
generate "". i don't think the fact that there is another way to get "" is
a particularly large problem.

On the other hand, and in support of your position it came up as Michael
Lawrence and I were talking about this that asChar from utils.c  will give
you NA_STRING when you give it R_NilValue. That is a coercion though,
whereas arguably mkString is not. That said, consistency would probably be
good.

~G
#
> On Tue, May 24, 2016 at 9:30 AM, Jeroen Ooms <jeroen.ooms at stat.ucla.edu>
> wrote:
>> On Tue, May 24, 2016 at 5:59 PM, Gabriel Becker <gmbecker at ucdavis.edu>
>> wrote:
>> > Shouldn't Rf_mkString(NULL) return (the c-level equivalent of)
    >> character()
    >> > rather than the NA_character_?
    >> 
    >> No. It should still be safe to assume that mkString() always returns a
    >> character vector of exactly length one. Anything else could lead to
    >> type errors.
    >> 


    > Well the thing is you're passing an invalid pointer, that doesn't point to
    > a C string, to a constructor expecting a valid const char *. I'm fine with
    > the contract being that mkString always returns a character vector of
    > length one, but that doesn't necessarily mean that the function needs to
    > accept NULL pointers. The contract as I understand it is that if you give
    > it a C string, it will create a CHARSXP for that string. In this light,
    > Bill's suggestion that it throw an error seems the most principled
    > response. I would think you would need to at the very least emit a warning.

I agree with Jerooen that mkChar() and mkString() may be
used in contexts where they can end up with a NULL and hence
should not segfault... and hence am willing the extra (very
small) penalty of checking for NULL.

    >> 
    >> > An empty string and NULL aren't the same.
    >> 
    >> Exactly! So if you pass in an empty C string, you get an empty R
    >> string, and if you pass in a null pointer you get NA.
    >> 
    >> Rf_mkString(NULL)   <--> NA
    >> Rf_mkString("")   <--> ""
    >> 
    >> There is no ambiguity, and much better than segfaulting.

Better than segfaulting, yes, but really agree with Bill (and
Gabe), also for Rf_mkChar(NULL): 
I think both functions should give an error in such a case
rather than returning NA_character_

It is an accident of some kind if they got NULL, no?

--
Martin Maechler,
ETH Zurich



    > Well, better than segfaulting is not really relevant here. No one is
    > arguing that it should segfault. The question is what behavior it should
    > have when it doesn't segfault.

    > It's true that a C empty string is not the same as NULL, but NULL isn't the
    > same as NA either. Semantically, for your use-case (which I gather arose
    > from interactions we had :) ) the NULL means there is no version, while NA
    > indicates there is a version but we don't know what it is. Imagine an
    > object class that represents a persons name (first, middle, last). Now take
    > two people, One has no middle name (and we know that when creating the
    > object) and another for whom we don't have any information about the middle
    > name, only first and last were reported. I would expect the first one to
    > have middle name either NULL or (in a data.frame context) "", while the
    > second would have NA_character_. In this light, mkString should arguably
    > generate "". i don't think the fact that there is another way to get "" is
    > a particularly large problem.

    > On the other hand, and in support of your position it came up as Michael
    > Lawrence and I were talking about this that asChar from utils.c  will give
    > you NA_STRING when you give it R_NilValue. That is a coercion though,
    > whereas arguably mkString is not. That said, consistency would probably be
    > good.

    > ~G


    > -- 
    > Gabriel Becker, PhD
    > Associate Scientist (Bioinformatics)
    > Genentech Research

    > [[alternative HTML version deleted]]

    > ______________________________________________
    > R-devel at r-project.org mailing list
    > https://stat.ethz.ch/mailman/listinfo/r-devel
#
On Wed, May 25, 2016 at 12:31 PM, Martin Maechler
<maechler at stat.math.ethz.ch> wrote:
Not necessarily. A char* of NULL can be a string which is not
initiated or simply unavailable due to configuration.

The example from my original email was in curl package which exposes
the version string of libz that was used to build libcurl:

  mkString(data->libz_version)

This worked on all platforms that I tested. However a user found that
if libcurl was configured --without-libz (which is uncommon) the
libz_version string does not get set by libcurl and is always NULL. I
had not foreseen this and it would lead to a segfault.

I think making mkString() return NA for null strings lead to the most
robust behavior. Raising an exception seems a little harsh to me, as
there is no way the user would be able to recover from this, and there
might not be an actual problem at all.
#
On Wed, May 25, 2016 at 4:23 AM, Jeroen Ooms <jeroen.ooms at stat.ucla.edu>
wrote:
Robust in the sense of no error being thrown, but perhaps only correct by
accident. NULL is not a valid C string --- should functions always return
NA on invalid input? As Gabe mentions, in the cited use case, it's not
clear whether the appropriate value is NA, "", or something else entirely.
Generalization seems risky at this point.

  
  
#
On Wed, May 25, 2016 at 7:22 AM, Michael Lawrence <lawrence.michael at gene.com
I'm not disagreeing with what's been said in this thread, but I can't help
but recall that I brought up this exact issue probably 15 years ago and was
told (by Brian, I believe) "don't do that" (pass a null pointer), which was
perfectly fine. The real issue was not the behavior but that it was not
documented or consistent. I've lived by the mantra since that you can never
trust a pointer in R code. User must always check for NULL.

I just wrote my own functions mkXXX_safe that wrap the internals and check
the pointer.

THK

http://www.keittlab.org/
#
On Wed, 25 May 2016, Tim Keitt wrote:

            
In _C_ code.  This is true whether you are calling into the R C API or
any other C library: you as the C programmer need to make sure either
that passing NULL is OK or make sure you don't do that.

I wouldn't object to mkXXX checking for NULL and signaling an error
instead of segfaulting, but good C code calling mkXXX should still
typically do its own check and handle the situation in an appropriate
way.

Best,

luke

  
    
#
http://www.keittlab.org/
On Wed, May 25, 2016 at 10:43 AM, <luke-tierney at uiowa.edu> wrote:

            
I agree -- I meant it was "perfectly fine" to remind us we need to check
pointers. Its really a documentation issue.

THK

  
  
#
Hi,

I tend to agree with the objections expressed earlier. I would only
add that making the NULL pointer semantically equivalent to NA would
introduce a precedent that could lead to some confusion. For example
it would set the expectation that CHAR(Rf_mkChar(NULL)) is NULL,
which is not the case AFAIK. Or that low-level string manipulation
utilities that take a C-string as input (e.g. Rf_reEnc()) accept NULL
and propagate it.

Of course these things can be modified to be consistent with the new
"NULL pointer == NA" paradigm but this might end up being a bigger move
than what it seems at first sight...

Cheers,
H.
On 05/25/2016 08:43 AM, luke-tierney at uiowa.edu wrote: