I would like to propose that Rf_mkString(NULL) and Rf_mkChar(NULL) return NA rather than segfault. Case: the mkString() and mkChar() functions are convenient to wrap strings returned by e.g. external C libraries into an R vector. However sometimes a library returns NULL instead of a string when the result is unavailable. In some C libraries this can happen unexpectedly or is even undocumented. A good R package author always checks results for a null pointer, and deals with it accordingly. But sometimes we make assumptions. There was an example in the 'curl' package where a documented version string was suddenly NULL if libcurl was built with some unusual configuration. These problems are hard to catch and I don't see any benefit of segfaulting for such edge cases. Some packages use a macro like this to protect against such problems: #define make_string(x) x ? Rf_mkString(x) : ScalarString(NA_STRING) But I think it would make sense if this was the default behavior in Rf_mkString and Rf_mkChar.
Suggestion: mkString(NULL) should be NA
13 messages · Jeroen Ooms, Gabriel Becker, William Dunlap +5 more
12 days later
On Thu, May 12, 2016 at 1:20 PM, Jeroen Ooms <jeroen.ooms at stat.ucla.edu> wrote:
I would like to propose that Rf_mkString(NULL) and Rf_mkChar(NULL) return NA rather than segfault.
An example implementation: https://git.io/vroxm With this patch, mkChar(NULL), mkCharCE(NULL, ce) would return NA_STRING rather than segfault at strlen(NULL). This automatically fixes mkString(NULL) as well which wraps mkChar (See Rinlinedfuns.h).
Shouldn't Rf_mkString(NULL) return (the c-level equivalent of) character() rather than the NA_character_? An empty string and NULL aren't the same. It seems reasonable for Rf_mkChar to give NA_character_ though. ~G On Tue, May 24, 2016 at 8:42 AM, Jeroen Ooms <jeroen.ooms at stat.ucla.edu> wrote:
On Thu, May 12, 2016 at 1:20 PM, Jeroen Ooms <jeroen.ooms at stat.ucla.edu> wrote:
I would like to propose that Rf_mkString(NULL) and Rf_mkChar(NULL) return NA rather than segfault.
An example implementation: https://git.io/vroxm With this patch, mkChar(NULL), mkCharCE(NULL, ce) would return NA_STRING rather than segfault at strlen(NULL). This automatically fixes mkString(NULL) as well which wraps mkChar (See Rinlinedfuns.h).
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Gabriel Becker, PhD Associate Scientist (Bioinformatics) Genentech Research [[alternative HTML version deleted]]
Why should Rf_mkString(NULL) produce NA_STRING instead of "" (R_BlankString)? I prefer that passing in a nil pointer would cause an error instead, as the nil may arise by accident, perhaps a pointer to freed memory, and I would like to be notified that my code is bad instead of getting a random NA_STRING. Bill Dunlap TIBCO Software wdunlap tibco.com On Tue, May 24, 2016 at 8:42 AM, Jeroen Ooms <jeroen.ooms at stat.ucla.edu> wrote:
On Thu, May 12, 2016 at 1:20 PM, Jeroen Ooms <jeroen.ooms at stat.ucla.edu> wrote:
I would like to propose that Rf_mkString(NULL) and Rf_mkChar(NULL) return NA rather than segfault.
An example implementation: https://git.io/vroxm With this patch, mkChar(NULL), mkCharCE(NULL, ce) would return NA_STRING rather than segfault at strlen(NULL). This automatically fixes mkString(NULL) as well which wraps mkChar (See Rinlinedfuns.h).
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
On Tue, May 24, 2016 at 5:59 PM, Gabriel Becker <gmbecker at ucdavis.edu> wrote:
Shouldn't Rf_mkString(NULL) return (the c-level equivalent of) character() rather than the NA_character_?
No. It should still be safe to assume that mkString() always returns a character vector of exactly length one. Anything else could lead to type errors.
An empty string and NULL aren't the same.
Exactly! So if you pass in an empty C string, you get an empty R
string, and if you pass in a null pointer you get NA.
Rf_mkString(NULL) <--> NA
Rf_mkString("") <--> ""
There is no ambiguity, and much better than segfaulting.
On Tue, May 24, 2016 at 9:30 AM, Jeroen Ooms <jeroen.ooms at stat.ucla.edu> wrote:
On Tue, May 24, 2016 at 5:59 PM, Gabriel Becker <gmbecker at ucdavis.edu> wrote:
Shouldn't Rf_mkString(NULL) return (the c-level equivalent of)
character()
rather than the NA_character_?
No. It should still be safe to assume that mkString() always returns a character vector of exactly length one. Anything else could lead to type errors.
Well the thing is you're passing an invalid pointer, that doesn't point to a C string, to a constructor expecting a valid const char *. I'm fine with the contract being that mkString always returns a character vector of length one, but that doesn't necessarily mean that the function needs to accept NULL pointers. The contract as I understand it is that if you give it a C string, it will create a CHARSXP for that string. In this light, Bill's suggestion that it throw an error seems the most principled response. I would think you would need to at the very least emit a warning.
An empty string and NULL aren't the same.
Exactly! So if you pass in an empty C string, you get an empty R
string, and if you pass in a null pointer you get NA.
Rf_mkString(NULL) <--> NA
Rf_mkString("") <--> ""
There is no ambiguity, and much better than segfaulting.
Well, better than segfaulting is not really relevant here. No one is arguing that it should segfault. The question is what behavior it should have when it doesn't segfault. It's true that a C empty string is not the same as NULL, but NULL isn't the same as NA either. Semantically, for your use-case (which I gather arose from interactions we had :) ) the NULL means there is no version, while NA indicates there is a version but we don't know what it is. Imagine an object class that represents a persons name (first, middle, last). Now take two people, One has no middle name (and we know that when creating the object) and another for whom we don't have any information about the middle name, only first and last were reported. I would expect the first one to have middle name either NULL or (in a data.frame context) "", while the second would have NA_character_. In this light, mkString should arguably generate "". i don't think the fact that there is another way to get "" is a particularly large problem. On the other hand, and in support of your position it came up as Michael Lawrence and I were talking about this that asChar from utils.c will give you NA_STRING when you give it R_NilValue. That is a coercion though, whereas arguably mkString is not. That said, consistency would probably be good. ~G
Gabriel Becker, PhD Associate Scientist (Bioinformatics) Genentech Research [[alternative HTML version deleted]]
Gabriel Becker <gmbecker at ucdavis.edu>
on Tue, 24 May 2016 10:30:48 -0700 writes:
> On Tue, May 24, 2016 at 9:30 AM, Jeroen Ooms <jeroen.ooms at stat.ucla.edu>
> wrote:
>> On Tue, May 24, 2016 at 5:59 PM, Gabriel Becker <gmbecker at ucdavis.edu>
>> wrote:
>> > Shouldn't Rf_mkString(NULL) return (the c-level equivalent of)
>> character()
>> > rather than the NA_character_?
>>
>> No. It should still be safe to assume that mkString() always returns a
>> character vector of exactly length one. Anything else could lead to
>> type errors.
>>
> Well the thing is you're passing an invalid pointer, that doesn't point to
> a C string, to a constructor expecting a valid const char *. I'm fine with
> the contract being that mkString always returns a character vector of
> length one, but that doesn't necessarily mean that the function needs to
> accept NULL pointers. The contract as I understand it is that if you give
> it a C string, it will create a CHARSXP for that string. In this light,
> Bill's suggestion that it throw an error seems the most principled
> response. I would think you would need to at the very least emit a warning.
I agree with Jerooen that mkChar() and mkString() may be
used in contexts where they can end up with a NULL and hence
should not segfault... and hence am willing the extra (very
small) penalty of checking for NULL.
>>
>> > An empty string and NULL aren't the same.
>>
>> Exactly! So if you pass in an empty C string, you get an empty R
>> string, and if you pass in a null pointer you get NA.
>>
>> Rf_mkString(NULL) <--> NA
>> Rf_mkString("") <--> ""
>>
>> There is no ambiguity, and much better than segfaulting.
Better than segfaulting, yes, but really agree with Bill (and
Gabe), also for Rf_mkChar(NULL):
I think both functions should give an error in such a case
rather than returning NA_character_
It is an accident of some kind if they got NULL, no?
--
Martin Maechler,
ETH Zurich
> Well, better than segfaulting is not really relevant here. No one is
> arguing that it should segfault. The question is what behavior it should
> have when it doesn't segfault.
> It's true that a C empty string is not the same as NULL, but NULL isn't the
> same as NA either. Semantically, for your use-case (which I gather arose
> from interactions we had :) ) the NULL means there is no version, while NA
> indicates there is a version but we don't know what it is. Imagine an
> object class that represents a persons name (first, middle, last). Now take
> two people, One has no middle name (and we know that when creating the
> object) and another for whom we don't have any information about the middle
> name, only first and last were reported. I would expect the first one to
> have middle name either NULL or (in a data.frame context) "", while the
> second would have NA_character_. In this light, mkString should arguably
> generate "". i don't think the fact that there is another way to get "" is
> a particularly large problem.
> On the other hand, and in support of your position it came up as Michael
> Lawrence and I were talking about this that asChar from utils.c will give
> you NA_STRING when you give it R_NilValue. That is a coercion though,
> whereas arguably mkString is not. That said, consistency would probably be
> good.
> ~G
> --
> Gabriel Becker, PhD
> Associate Scientist (Bioinformatics)
> Genentech Research
> [[alternative HTML version deleted]]
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
On Wed, May 25, 2016 at 12:31 PM, Martin Maechler
<maechler at stat.math.ethz.ch> wrote:
Better than segfaulting, yes, but really agree with Bill (and Gabe), also for Rf_mkChar(NULL): I think both functions should give an error in such a case rather than returning NA_character_ It is an accident of some kind if they got NULL, no?
Not necessarily. A char* of NULL can be a string which is not initiated or simply unavailable due to configuration. The example from my original email was in curl package which exposes the version string of libz that was used to build libcurl: mkString(data->libz_version) This worked on all platforms that I tested. However a user found that if libcurl was configured --without-libz (which is uncommon) the libz_version string does not get set by libcurl and is always NULL. I had not foreseen this and it would lead to a segfault. I think making mkString() return NA for null strings lead to the most robust behavior. Raising an exception seems a little harsh to me, as there is no way the user would be able to recover from this, and there might not be an actual problem at all.
On Wed, May 25, 2016 at 4:23 AM, Jeroen Ooms <jeroen.ooms at stat.ucla.edu> wrote:
On Wed, May 25, 2016 at 12:31 PM, Martin Maechler <maechler at stat.math.ethz.ch> wrote:
Better than segfaulting, yes, but really agree with Bill (and Gabe), also for Rf_mkChar(NULL): I think both functions should give an error in such a case rather than returning NA_character_ It is an accident of some kind if they got NULL, no?
Not necessarily. A char* of NULL can be a string which is not initiated or simply unavailable due to configuration. The example from my original email was in curl package which exposes the version string of libz that was used to build libcurl: mkString(data->libz_version) This worked on all platforms that I tested. However a user found that if libcurl was configured --without-libz (which is uncommon) the libz_version string does not get set by libcurl and is always NULL. I had not foreseen this and it would lead to a segfault. I think making mkString() return NA for null strings lead to the most robust behavior. Raising an exception seems a little harsh to me, as there is no way the user would be able to recover from this, and there might not be an actual problem at all.
Robust in the sense of no error being thrown, but perhaps only correct by accident. NULL is not a valid C string --- should functions always return NA on invalid input? As Gabe mentions, in the cited use case, it's not clear whether the appropriate value is NA, "", or something else entirely. Generalization seems risky at this point.
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
On Wed, May 25, 2016 at 7:22 AM, Michael Lawrence <lawrence.michael at gene.com
wrote:
On Wed, May 25, 2016 at 4:23 AM, Jeroen Ooms <jeroen.ooms at stat.ucla.edu> wrote:
I'm not disagreeing with what's been said in this thread, but I can't help but recall that I brought up this exact issue probably 15 years ago and was told (by Brian, I believe) "don't do that" (pass a null pointer), which was perfectly fine. The real issue was not the behavior but that it was not documented or consistent. I've lived by the mantra since that you can never trust a pointer in R code. User must always check for NULL. I just wrote my own functions mkXXX_safe that wrap the internals and check the pointer. THK http://www.keittlab.org/
On Wed, 25 May 2016, Tim Keitt wrote:
On Wed, May 25, 2016 at 7:22 AM, Michael Lawrence <lawrence.michael at gene.com
wrote:
On Wed, May 25, 2016 at 4:23 AM, Jeroen Ooms <jeroen.ooms at stat.ucla.edu> wrote:
I'm not disagreeing with what's been said in this thread, but I can't help but recall that I brought up this exact issue probably 15 years ago and was told (by Brian, I believe) "don't do that" (pass a null pointer), which was perfectly fine. The real issue was not the behavior but that it was not documented or consistent. I've lived by the mantra since that you can never trust a pointer in R code. User must always check for NULL.
In _C_ code. This is true whether you are calling into the R C API or any other C library: you as the C programmer need to make sure either that passing NULL is OK or make sure you don't do that. I wouldn't object to mkXXX checking for NULL and signaling an error instead of segfaulting, but good C code calling mkXXX should still typically do its own check and handle the situation in an appropriate way. Best, luke
I just wrote my own functions mkXXX_safe that wrap the internals and check the pointer. THK http://www.keittlab.org/ [[alternative HTML version deleted]]
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Luke Tierney
Ralph E. Wareham Professor of Mathematical Sciences
University of Iowa Phone: 319-335-3386
Department of Statistics and Fax: 319-335-3017
Actuarial Science
241 Schaeffer Hall email: luke-tierney at uiowa.edu
Iowa City, IA 52242 WWW: http://www.stat.uiowa.edu
http://www.keittlab.org/
On Wed, May 25, 2016 at 10:43 AM, <luke-tierney at uiowa.edu> wrote:
On Wed, 25 May 2016, Tim Keitt wrote: On Wed, May 25, 2016 at 7:22 AM, Michael Lawrence <
lawrence.michael at gene.com
wrote:
On Wed, May 25, 2016 at 4:23 AM, Jeroen Ooms <jeroen.ooms at stat.ucla.edu>
wrote:
I'm not disagreeing with what's been said in this thread, but I can't help but recall that I brought up this exact issue probably 15 years ago and was told (by Brian, I believe) "don't do that" (pass a null pointer), which was perfectly fine. The real issue was not the behavior but that it was not documented or consistent. I've lived by the mantra since that you can never trust a pointer in R code. User must always check for NULL.
In _C_ code. This is true whether you are calling into the R C API or any other C library: you as the C programmer need to make sure either that passing NULL is OK or make sure you don't do that.
I agree -- I meant it was "perfectly fine" to remind us we need to check pointers. Its really a documentation issue. THK
I wouldn't object to mkXXX checking for NULL and signaling an error instead of segfaulting, but good C code calling mkXXX should still typically do its own check and handle the situation in an appropriate way. Best, luke
I just wrote my own functions mkXXX_safe that wrap the internals and check the pointer. THK http://www.keittlab.org/ [[alternative HTML version deleted]]
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
-- Luke Tierney Ralph E. Wareham Professor of Mathematical Sciences University of Iowa Phone: 319-335-3386 Department of Statistics and Fax: 319-335-3017 Actuarial Science 241 Schaeffer Hall email: luke-tierney at uiowa.edu Iowa City, IA 52242 WWW: http://www.stat.uiowa.edu
Hi, I tend to agree with the objections expressed earlier. I would only add that making the NULL pointer semantically equivalent to NA would introduce a precedent that could lead to some confusion. For example it would set the expectation that CHAR(Rf_mkChar(NULL)) is NULL, which is not the case AFAIK. Or that low-level string manipulation utilities that take a C-string as input (e.g. Rf_reEnc()) accept NULL and propagate it. Of course these things can be modified to be consistent with the new "NULL pointer == NA" paradigm but this might end up being a bigger move than what it seems at first sight... Cheers, H.
On 05/25/2016 08:43 AM, luke-tierney at uiowa.edu wrote:
On Wed, 25 May 2016, Tim Keitt wrote:
On Wed, May 25, 2016 at 7:22 AM, Michael Lawrence <lawrence.michael at gene.com
wrote:
On Wed, May 25, 2016 at 4:23 AM, Jeroen Ooms <jeroen.ooms at stat.ucla.edu> wrote:
I'm not disagreeing with what's been said in this thread, but I can't help but recall that I brought up this exact issue probably 15 years ago and was told (by Brian, I believe) "don't do that" (pass a null pointer), which was perfectly fine. The real issue was not the behavior but that it was not documented or consistent. I've lived by the mantra since that you can never trust a pointer in R code. User must always check for NULL.
In _C_ code. This is true whether you are calling into the R C API or any other C library: you as the C programmer need to make sure either that passing NULL is OK or make sure you don't do that. I wouldn't object to mkXXX checking for NULL and signaling an error instead of segfaulting, but good C code calling mkXXX should still typically do its own check and handle the situation in an appropriate way. Best, luke
I just wrote my own functions mkXXX_safe that wrap the internals and check the pointer. THK http://www.keittlab.org/ [[alternative HTML version deleted]]
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Herv? Pag?s Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fredhutch.org Phone: (206) 667-5791 Fax: (206) 667-1319