First, let me apologize in advance if this is the wrong place to submit
a suggestion for a change to functions in the base-R package. It never
really occurred to me that I'd have an idea worthy of such a change.
My idea is to provide an upgrade to all the "sets" tools (intersect,
union, setdiff, setequal) that allows the user to apply them in a
strictly algebraic style.
The current tools, as well documented, remove duplicate values in the
input vectors. This can be helpful in stats work, but is inconsistent
with the mathematical concept of sets and set measure. What I propose
is that all these functions be given an additional argument with a
default value: "multiple=FALSE" . When called this way, the functions
remain as at present. When called with "multiple=TRUE," they treat the
input vectors as true 'sets' of elements.
I've already written and tested upgrades to all four functions, so if
upgrading the base-R package is not appropriate, I'll post as a package
to CRAN. It just seems more sensible to add to the base.
Thanks in advance for any advice or comments.
(Please be sure to email, as I can't recall if I'm currently registered
for r-devel)
Here's an example of the new code:
intersect<-function (x, y,multiple=FALSE)
{
y <- as.vector(y)
trueint <- y[match(as.vector(x), y, 0L)]
if(!multiple) trueint <- unique(trueint)
return(trueint)
}
thanks
Carl
-
suggestion for "sets" tools upgrade
5 messages · R. Michael Weylandt, Kevin Coombes, Duncan Murdoch +1 more
On Thu, Feb 6, 2014 at 8:31 PM, Carl Witthoft <carl at witthoft.com> wrote:
First, let me apologize in advance if this is the wrong place to submit a suggestion for a change to functions in the base-R package. It never really occurred to me that I'd have an idea worthy of such a change. My idea is to provide an upgrade to all the "sets" tools (intersect, union, setdiff, setequal) that allows the user to apply them in a strictly algebraic style. The current tools, as well documented, remove duplicate values in the input vectors. This can be helpful in stats work, but is inconsistent with the mathematical concept of sets and set measure.
No comments about back-compatability concerns, etc. but why do you
think this is closer to the "mathematical concept of sets"? As I
learned them, sets have no repeats (or order) and other languages with
set primitives tend to agree:
python> {1,1,2,3} == {1,2,3}
True
I believe C++ calls what you're looking for a multiset (albeit with a
guarantee or orderedness).
Cheers,
Michael
As a mathematician by training (and a former practicing mathematician,
both of which qualifications I rarely feel compelled to pull out of the
closet), I have to agree with Michael's challenge to the original
assertion about the "mathematical concept of sets".
Sets are collections of distinct objects (at least in Cantors' original
naive definition) and do not have a notion of "duplicate values". In
the modern axiomatic definition, one axiom is that "two sets are equal
if and only if they contain the same members". To expand on Michael's
example, the union of {1, 2} with {1, 3} is {1, 2, 3}, not {1, 2, 1, 3}
since there is only one distinct object designated by the value "1".
A computer programming language could choose to use the ordered vector
(or list) [1, 2, 1, 3] as an internal representation of the union of
[1,2], and [1,3], but it would then have to work hard to perform every
other meaningful set operation. For instance, the cardinality of the
union still has to equal three (not four, which is the length of the
list), since there are exactly three distinct objects that are members.
And, as Michael points out, the set represented by [1,2,3] has to be
equal to the set represented by [1,2,1,3] since they contain exactly the
same members.
Kevin
On 2/6/2014 9:39 PM, R. Michael Weylandt wrote:
On Thu, Feb 6, 2014 at 8:31 PM, Carl Witthoft <carl at witthoft.com> wrote:
First, let me apologize in advance if this is the wrong place to submit a suggestion for a change to functions in the base-R package. It never really occurred to me that I'd have an idea worthy of such a change. My idea is to provide an upgrade to all the "sets" tools (intersect, union, setdiff, setequal) that allows the user to apply them in a strictly algebraic style. The current tools, as well documented, remove duplicate values in the input vectors. This can be helpful in stats work, but is inconsistent with the mathematical concept of sets and set measure.
No comments about back-compatability concerns, etc. but why do you
think this is closer to the "mathematical concept of sets"? As I
learned them, sets have no repeats (or order) and other languages with
set primitives tend to agree:
python> {1,1,2,3} == {1,2,3}
True
I believe C++ calls what you're looking for a multiset (albeit with a
guarantee or orderedness).
Cheers,
Michael
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
On 14-02-06 8:31 PM, Carl Witthoft wrote:
First, let me apologize in advance if this is the wrong place to submit a suggestion for a change to functions in the base-R package. It never really occurred to me that I'd have an idea worthy of such a change. My idea is to provide an upgrade to all the "sets" tools (intersect, union, setdiff, setequal) that allows the user to apply them in a strictly algebraic style. The current tools, as well documented, remove duplicate values in the input vectors. This can be helpful in stats work, but is inconsistent with the mathematical concept of sets and set measure.
I understand what you are asking for, but I think this justification for it is just wrong. Sets don't have duplicated elements: an element is in a set, or it is not. It can't be in the set more than once. What I propose
is that all these functions be given an additional argument with a
default value: "multiple=FALSE" . When called this way, the functions
remain as at present. When called with "multiple=TRUE," they treat the
input vectors as true 'sets' of elements.
I've already written and tested upgrades to all four functions, so if
upgrading the base-R package is not appropriate, I'll post as a package
to CRAN. It just seems more sensible to add to the base.
Thanks in advance for any advice or comments.
(Please be sure to email, as I can't recall if I'm currently registered
for r-devel)
Here's an example of the new code:
intersect<-function (x, y,multiple=FALSE)
{
y <- as.vector(y)
trueint <- y[match(as.vector(x), y, 0L)]
if(!multiple) trueint <- unique(trueint)
return(trueint)
}
This is not symmetric. I'd like intersect(x,y,TRUE) to be the same as intersect(y,x,TRUE), up to re-ordering. That's not true of your function: > x <- c(1,1,2,3) > y <- c(1,1,1,4) > intersect(x,y,multiple=TRUE) [1] 1 1 > intersect(y,x,multiple=TRUE) [1] 1 1 1 I'd suggest that you clearly define what you mean by your functions, and put them in a package, along with examples where they give more useful results than the standard definitions. I think the current base package functions match the mathematical definitions better. Duncan Murdoch
Thanks to Duncan and all who responded. I agree that the algebraic set rules do not allow for indistinguishable elements; I must have been deeply immersed in quantum fermions when I wrote "strictly" rather than "less" in front of "algebraic style. I'll clean up my code (so that intersect() remains symmetric, among other things) , and submit as a separate package to CRAN. Carl
On 2/7/14 7:37 AM, Duncan Murdoch wrote:
On 14-02-06 8:31 PM, Carl Witthoft wrote:
My idea is to provide an upgrade to all the "sets" tools (intersect, union, setdiff, setequal) that allows the user to apply them in a strictly algebraic style. The current tools, as well documented, remove duplicate values in the input vectors. This can be helpful in stats work, but is inconsistent with the mathematical concept of sets and set measure.
I understand what you are asking for, but I think this justification for it is just wrong. Sets don't have duplicated elements: an element is in a set, or it is not. It can't be in the set more than once. What I propose
is that all these functions be given an additional argument with a default value: "multiple=FALSE" . When called this way, the functions remain as at present. When called with "multiple=TRUE," they treat the input vectors as true 'sets' of elements.
Here's an example of the new code:
intersect<-function (x, y,multiple=FALSE)
{
y <- as.vector(y)
trueint <- y[match(as.vector(x), y, 0L)]
if(!multiple) trueint <- unique(trueint)
return(trueint)
}
This is not symmetric. I'd like intersect(x,y,TRUE) to be the same as intersect(y,x,TRUE), up to re-ordering. That's not true of your function:
> x <- c(1,1,2,3) > y <- c(1,1,1,4) > intersect(x,y,multiple=TRUE)
[1] 1 1
> intersect(y,x,multiple=TRUE)
[1] 1 1 1 I'd suggest that you clearly define what you mean by your functions, and put them in a package, along with examples where they give more useful results than the standard definitions. I think the current base package functions match the mathematical definitions better. Duncan Murdoch
Sent from a parallel universe almost, but not entirely, nothing at all like this one.