I wonder if we might also include an "equiv" function along with the other set functions (ie "union", "intersect", etc), perhaps along the lines of "equiv" <- function(x, y) all(c(match(x, y, 0)>0, match(y, x, 0)>0)) (which I think might be the quickest implementation). I use this type of function quite frequently: is there some reason why it is not in the base? Cheers, Jonathan. Jonathan Rougier Science Laboratories Department of Mathematical Sciences South Road University of Durham Durham DH1 3LE "[B]egin upon the precept ... that the things we see are to be weighed in the scale with what we know" (Meredith, 1879, The Egoist) -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-devel-request@stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
set functions
11 messages · Peter Dalgaard, Jonathan Rougier, Martin Maechler +1 more
Jonathan Rougier <J.C.Rougier@durham.ac.uk> writes:
I wonder if we might also include an "equiv" function along with the other set functions (ie "union", "intersect", etc), perhaps along the lines of "equiv" <- function(x, y) all(c(match(x, y, 0)>0, match(y, x, 0)>0)) (which I think might be the quickest implementation). I use this type of function quite frequently: is there some reason why it is not in the base?
length(setdiff(x,y))==0 appears to be about twice as fast....
O__ ---- Peter Dalgaard Blegdamsvej 3 c/ /'_ --- Dept. of Biostatistics 2200 Cph. N (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~~~~~~~~~ - (p.dalgaard@biostat.ku.dk) FAX: (+45) 35327907 -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-devel-request@stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
On 4 Jan 2000, Peter Dalgaard BSA wrote:
"equiv" <- function(x, y) all(c(match(x, y, 0)>0, match(y, x, 0)>0))
length(setdiff(x,y))==0 appears to be about twice as fast....
But I don't think that would be right! length(setdiff(1:4, 1:5))==0 # is TRUE equiv(1:4, 1:5) # clearly FALSE Jonathan. Jonathan Rougier Science Laboratories Department of Mathematical Sciences South Road University of Durham Durham DH1 3LE "[B]egin upon the precept ... that the things we see are to be weighed in the scale with what we know" (Meredith, 1879, The Egoist) -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-devel-request@stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
Jonathan Rougier <J.C.Rougier@durham.ac.uk> writes:
On 4 Jan 2000, Peter Dalgaard BSA wrote:
"equiv" <- function(x, y) all(c(match(x, y, 0)>0, match(y, x, 0)>0))
length(setdiff(x,y))==0 appears to be about twice as fast....
But I don't think that would be right! length(setdiff(1:4, 1:5))==0 # is TRUE equiv(1:4, 1:5) # clearly FALSE
Argh. I was thinking of the symmetric set difference. So you'd need
setdiff(y,x)==0 & setdiff(x,y)==0 which is obviously only half as fast
as twice as fast....
However:
equiv<-function(x,y)
length(x<-unique(x))==length(y<-unique(y)) &&
all(sort(x)==sort(y))
O__ ---- Peter Dalgaard Blegdamsvej 3 c/ /'_ --- Dept. of Biostatistics 2200 Cph. N (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~~~~~~~~~ - (p.dalgaard@biostat.ku.dk) FAX: (+45) 35327907 -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-devel-request@stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
On 4 Jan 2000, Peter Dalgaard BSA wrote:
length(setdiff(1:4, 1:5))==0 # is TRUE equiv(1:4, 1:5) # clearly FALSE
Argh. I was thinking of the symmetric set difference. So you'd need
setdiff(y,x)==0 & setdiff(x,y)==0 which is obviously only half as fast
as twice as fast....
However:
equiv<-function(x,y)
length(x<-unique(x))==length(y<-unique(y)) &&
all(sort(x)==sort(y))
Yes, I wondered about that, and also about
"equiv" <-
function(x, y) {
x <- unique(x)
y <- unique(y)
length(x)==length(y) && all(1:length(y) == sort(match(x, y, 0)))
}
but I thought that perhaps a sort would be more expensive than a second
call to match, and more so for two sorts. Cheers, Jonathan.
Jonathan Rougier Science Laboratories
Department of Mathematical Sciences South Road
University of Durham Durham DH1 3LE
"[B]egin upon the precept ... that the things we see are to be
weighed in the scale with what we know" (Meredith, 1879, The Egoist)
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !) To: r-devel-request@stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
Jonathan Rougier <J.C.Rougier@durham.ac.uk> writes:
equiv<-function(x,y)
length(x<-unique(x))==length(y<-unique(y)) &&
all(sort(x)==sort(y))
Yes, I wondered about that, and also about
"equiv" <-
function(x, y) {
x <- unique(x)
y <- unique(y)
length(x)==length(y) && all(1:length(y) == sort(match(x, y, 0)))
}
but I thought that perhaps a sort would be more expensive than a second
call to match, and more so for two sorts. Cheers, Jonathan.
Watch:
x<-1:50000 y<-x[order(runif(50000))] "equiv2" <- function(x, y) all(c(match(x, y, 0)>0, match(y, x, 0)>0)) equiv<-function(x,y)
+ length(x<-unique(x))==length(y<-unique(y)) && + all(sort(x)==sort(y))
system.time(equiv2(x,y))
[1] 3.10 0.02 3.00 0.00 0.00
system.time(equiv(x,y))
[1] 0.77 0.00 1.00 0.00 0.00
O__ ---- Peter Dalgaard Blegdamsvej 3 c/ /'_ --- Dept. of Biostatistics 2200 Cph. N (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~~~~~~~~~ - (p.dalgaard@biostat.ku.dk) FAX: (+45) 35327907 -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-devel-request@stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
On 4 Jan 2000, Peter Dalgaard BSA wrote:
Watch:
x<-1:50000 y<-x[order(runif(50000))] "equiv2" <- function(x, y) all(c(match(x, y, 0)>0, match(y, x, 0)>0)) equiv<-function(x,y)
+ length(x<-unique(x))==length(y<-unique(y)) && + all(sort(x)==sort(y))
system.time(equiv2(x,y))
[1] 3.10 0.02 3.00 0.00 0.00
system.time(equiv(x,y))
[1] 0.77 0.00 1.00 0.00 0.00
Yup -- that's much quicker! To re-ask the original question, would it be reasonable to include such a function along with the other set functions? Cheers, Jonathan. Jonathan Rougier Science Laboratories Department of Mathematical Sciences South Road University of Durham Durham DH1 3LE "[B]egin upon the precept ... that the things we see are to be weighed in the scale with what we know" (Meredith, 1879, The Egoist) -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-devel-request@stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
On 4 Jan 2000, Peter Dalgaard BSA wrote:
> Watch:
>
> > x<-1:50000
> > y<-x[order(runif(50000))]
> > "equiv2" <- function(x, y) all(c(match(x, y, 0)>0, match(y, x, 0)>0))
> > equiv<-function(x,y)
> + length(x<-unique(x))==length(y<-unique(y)) &&
> + all(sort(x)==sort(y))
> > system.time(equiv2(x,y))
> [1] 3.10 0.02 3.00 0.00 0.00
> > system.time(equiv(x,y))
> [1] 0.77 0.00 1.00 0.00 0.00
JonR> Yup -- that's much quicker! To re-ask the original question,
JonR> would it be reasonable to include such a function along with the
JonR> other set functions? Cheers, Jonathan.
quite a good idea, particularly, since we all have now learned that it is
non-trivial to write really efficiently.
However, I think "equiv" is not specific enough (could mean "equivalence of
arbitrary R objects").
Wouldn't "setequiv" or "setequal" be better ?
((and would you provide (to R-core) patches to
src/library/base/R/sets.R and src/library/base/man/sets.Rd))
Martin
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !) To: r-devel-request@stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
On Wed, 5 Jan 2000, Martin Maechler wrote:
On 4 Jan 2000, Peter Dalgaard BSA wrote:
> Watch: >
> > x<-1:50000 > > y<-x[order(runif(50000))] > > "equiv2" <- function(x, y) all(c(match(x, y, 0)>0, match(y, x, 0)>0)) > > equiv<-function(x,y)
> + length(x<-unique(x))==length(y<-unique(y)) && > + all(sort(x)==sort(y))
> > system.time(equiv2(x,y))
> [1] 3.10 0.02 3.00 0.00 0.00
> > system.time(equiv(x,y))
> [1] 0.77 0.00 1.00 0.00 0.00
JonR> Yup -- that's much quicker! To re-ask the original question,
JonR> would it be reasonable to include such a function along with the
JonR> other set functions? Cheers, Jonathan.
quite a good idea, particularly, since we all have now learned that it is
non-trivial to write really efficiently.
Some of us knew that. What worries me a bit is that optimizing code for the current R may not be a good idea. R currently spends a lot of its time on garbage collection (30 to 50% on my profiling) and it is planned to alter the memory allocator real soon now. When hashing of environments was introduced it made a lot of difference to some code, and little to others. That's not to say that we should not optimize, but trying hard may be a waste of time. (Says he having learnt the hard way across S-PLUS versions.) Brian
Brian D. Ripley, ripley@stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272860 (secr) Oxford OX1 3TG, UK Fax: +44 1865 272595 -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-devel-request@stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
On Wed, 5 Jan 2000, Martin Maechler wrote:
quite a good idea, particularly, since we all have now learned that it is non-trivial to write really efficiently.
And I've just noticed that Peter's ultra-quick sorting algorithm stumbles over NAs: "setequal" <- function(x,y) length(x<-unique(x))==length(y<-unique(y)) && all(sort(x)==sort(y)) "setequal2" <- function(x, y) all(c(match(x, y, 0)>0, match(y, x, 0)>0)) setequal(c(NA, 1:4), c(1:4, NA)) # TRUE setequal2(c(NA, 1:4), c(1:4, NA)) # TRUE setequal(c(NA, 1:4), c(1:4, 5)) # FALSE plus warning message setequal2(c(NA, 1:4), c(1:4, 5)) # FALSE Putting na.last=TRUE in sort does not help, as then there is a missing logical for the && following the call to all. Might I suggest, in the light of Brian's comments, that setequal2 is more in the spirit of the other set functions? Cheers, Jonathan. Jonathan Rougier Science Laboratories Department of Mathematical Sciences South Road University of Durham Durham DH1 3LE "[B]egin upon the precept ... that the things we see are to be weighed in the scale with what we know" (Meredith, 1879, The Egoist) -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-devel-request@stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
Hi Martin,
However, I think "equiv" is not specific enough (could mean "equivalence of
arbitrary R objects").
Wouldn't "setequiv" or "setequal" be better ?
((and would you provide (to R-core) patches to
src/library/base/R/sets.R and src/library/base/man/sets.Rd))
I think, after all of the discussion, particularly Brian's helpful
interventions, the original function prevails, although Peter's suggested
sorting function was very instructive.
"setequal" <- function(x, y) all(c(match(x, y, 0)>0, match(y, x, 0)>0))
The help function needs the following modifications:
\alias{setequal}
\description{Performs set union, intersection, difference, equality and
membership on two vectors.}
\usage{
union(x, y)
intersect(x, y)
setdiff(x, y)
setequal(x, y)
is.element(x, y)
}
%% There appears to be an extra tab or other white space in the arguments
%% field.
\examples{
x <- sample(1:20, 10)
y <- sample(3:23, 7)
union(x, y)
intersect(x, y)
setdiff(x, y)
setequal(x, y)
is.element(y, x)
}
Cheers, Jonathan.
Jonathan Rougier Science Laboratories
Department of Mathematical Sciences South Road
University of Durham Durham DH1 3LE
"[B]egin upon the precept ... that the things we see are to be
weighed in the scale with what we know" (Meredith, 1879, The Egoist)
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !) To: r-devel-request@stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._