set functions - R-devel | R Mailing Lists

Tue, Jan 4, 2000 7:47 AM #

I wonder if we might also include an "equiv" function along with the other
set functions (ie "union", "intersect", etc), perhaps along the lines of

"equiv" <- function(x, y) all(c(match(x, y, 0)>0, match(y, x, 0)>0))

(which I think might be the quickest implementation).  I use this type of
function quite frequently: is there some reason why it is not in the base?

Cheers, Jonathan.

Jonathan Rougier                       Science Laboratories
Department of Mathematical Sciences    South Road
University of Durham                   Durham DH1 3LE

"[B]egin upon the precept ... that the things we see are to be 
 weighed in the scale with what we know"  (Meredith, 1879, The Egoist)


-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-devel-request@stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._

Peter Dalgaard

Tue, Jan 4, 2000 8:16 AM #

Jonathan Rougier <J.C.Rougier@durham.ac.uk> writes:

length(setdiff(x,y))==0 appears to be about twice as fast....

O__  ---- Peter Dalgaard             Blegdamsvej 3  
  c/ /'_ --- Dept. of Biostatistics     2200 Cph. N   
 (*) \(*) -- University of Copenhagen   Denmark      Ph: (+45) 35327918
~~~~~~~~~~ - (p.dalgaard@biostat.ku.dk)             FAX: (+45) 35327907
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-devel-request@stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._

Jonathan Rougier

Tue, Jan 4, 2000 8:17 AM #

On 4 Jan 2000, Peter Dalgaard BSA wrote:

But I don't think that would be right!

length(setdiff(1:4, 1:5))==0	# is TRUE
equiv(1:4, 1:5)			# clearly FALSE

Jonathan.

Jonathan Rougier                       Science Laboratories
Department of Mathematical Sciences    South Road
University of Durham                   Durham DH1 3LE

"[B]egin upon the precept ... that the things we see are to be 
 weighed in the scale with what we know"  (Meredith, 1879, The Egoist)

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-devel-request@stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._

Peter Dalgaard

Tue, Jan 4, 2000 8:39 AM #

Jonathan Rougier <J.C.Rougier@durham.ac.uk> writes:

Argh. I was thinking of the symmetric set difference. So you'd need 
setdiff(y,x)==0 & setdiff(x,y)==0 which is obviously only half as fast
as twice as fast....

However:

equiv<-function(x,y) 
    length(x<-unique(x))==length(y<-unique(y)) && 
    all(sort(x)==sort(y))

O__  ---- Peter Dalgaard             Blegdamsvej 3  
  c/ /'_ --- Dept. of Biostatistics     2200 Cph. N   
 (*) \(*) -- University of Copenhagen   Denmark      Ph: (+45) 35327918
~~~~~~~~~~ - (p.dalgaard@biostat.ku.dk)             FAX: (+45) 35327907
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-devel-request@stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._

Jonathan Rougier

Tue, Jan 4, 2000 8:45 AM #

On 4 Jan 2000, Peter Dalgaard BSA wrote:

Yes, I wondered about that, and also about

"equiv" <-
function(x, y) {
  x <- unique(x)
  y <- unique(y)
  length(x)==length(y) && all(1:length(y) == sort(match(x, y, 0)))
}

but I thought that perhaps a sort would be more expensive than a second
call to match, and more so for two sorts.  Cheers, Jonathan.

Jonathan Rougier                       Science Laboratories
Department of Mathematical Sciences    South Road
University of Durham                   Durham DH1 3LE

"[B]egin upon the precept ... that the things we see are to be 
 weighed in the scale with what we know"  (Meredith, 1879, The Egoist)


-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-devel-request@stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._

Peter Dalgaard

Tue, Jan 4, 2000 8:55 AM #

Jonathan Rougier <J.C.Rougier@durham.ac.uk> writes:

Watch:

+     length(x<-unique(x))==length(y<-unique(y)) && 
+     all(sort(x)==sort(y))

[1] 3.10 0.02 3.00 0.00 0.00

[1] 0.77 0.00 1.00 0.00 0.00

O__  ---- Peter Dalgaard             Blegdamsvej 3  
  c/ /'_ --- Dept. of Biostatistics     2200 Cph. N   
 (*) \(*) -- University of Copenhagen   Denmark      Ph: (+45) 35327918
~~~~~~~~~~ - (p.dalgaard@biostat.ku.dk)             FAX: (+45) 35327907
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-devel-request@stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._

Jonathan Rougier

Tue, Jan 4, 2000 8:59 AM #

On 4 Jan 2000, Peter Dalgaard BSA wrote:

Yup -- that's much quicker!  To re-ask the original question, would it be
reasonable to include such a function along with the other set functions?
Cheers, Jonathan.

Jonathan Rougier                       Science Laboratories
Department of Mathematical Sciences    South Road
University of Durham                   Durham DH1 3LE

"[B]egin upon the precept ... that the things we see are to be 
 weighed in the scale with what we know"  (Meredith, 1879, The Egoist)

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-devel-request@stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._

Martin Maechler

Wed, Jan 5, 2000 1:50 AM #

On 4 Jan 2000, Peter Dalgaard BSA wrote:

> Watch:
 > 
 > > x<-1:50000
 > > y<-x[order(runif(50000))]
 > > "equiv2" <- function(x, y) all(c(match(x, y, 0)>0, match(y, x, 0)>0))
 > > equiv<-function(x,y) 
 > +     length(x<-unique(x))==length(y<-unique(y)) && 
 > +     all(sort(x)==sort(y)) 
 > > system.time(equiv2(x,y))
 > [1] 3.10 0.02 3.00 0.00 0.00
 > > system.time(equiv(x,y))
 > [1] 0.77 0.00 1.00 0.00 0.00

    JonR> Yup -- that's much quicker!  To re-ask the original question,
    JonR> would it be reasonable to include such a function along with the
    JonR> other set functions?  Cheers, Jonathan.

quite a good idea, particularly, since we all have now learned that it is
non-trivial to write really efficiently.

However, I think "equiv" is not specific enough (could mean "equivalence of
arbitrary R objects").
Wouldn't  "setequiv" or "setequal" be better ?

((and would you provide (to R-core)  patches to
    src/library/base/R/sets.R and src/library/base/man/sets.Rd))

Martin
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-devel-request@stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._

Brian Ripley

Wed, Jan 5, 2000 2:21 AM #

On Wed, 5 Jan 2000, Martin Maechler wrote:

Some of us knew that. What worries me a bit is that optimizing code for the
current R may not be a good idea. R currently spends a lot of its
time on garbage collection (30 to 50% on my profiling) and it is planned to 
alter the memory allocator real soon now.  When hashing of environments
was introduced it made a lot of difference to some code, and little to
others.  That's not to say that we should not optimize, but
trying hard may be a waste of time. (Says he having learnt the hard way
across S-PLUS versions.)

Brian

Brian D. Ripley,                  ripley@stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272860 (secr)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-devel-request@stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._

Jonathan Rougier

Wed, Jan 5, 2000 4:10 AM #

On Wed, 5 Jan 2000, Martin Maechler wrote:

And I've just noticed that Peter's ultra-quick sorting algorithm stumbles
over NAs:

"setequal" <- function(x,y)
  length(x<-unique(x))==length(y<-unique(y)) && all(sort(x)==sort(y))

"setequal2" <- function(x, y) all(c(match(x, y, 0)>0, match(y, x, 0)>0))

setequal(c(NA, 1:4), c(1:4, NA))	# TRUE
setequal2(c(NA, 1:4), c(1:4, NA))	# TRUE

setequal(c(NA, 1:4), c(1:4, 5))		# FALSE plus warning message
setequal2(c(NA, 1:4), c(1:4, 5))	# FALSE

Putting na.last=TRUE in sort does not help, as then there is a missing
logical for the && following the call to all.  Might I suggest, in the
light of Brian's comments, that setequal2 is more in the spirit of the
other set functions?

Cheers, Jonathan.

Jonathan Rougier                       Science Laboratories
Department of Mathematical Sciences    South Road
University of Durham                   Durham DH1 3LE

"[B]egin upon the precept ... that the things we see are to be 
 weighed in the scale with what we know"  (Meredith, 1879, The Egoist)

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-devel-request@stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._

Jonathan Rougier

Thu, Jan 6, 2000 1:19 AM #

Hi Martin,

I think, after all of the discussion, particularly Brian's helpful
interventions, the original function prevails, although Peter's suggested
sorting function was very instructive.

"setequal" <- function(x, y) all(c(match(x, y, 0)>0, match(y, x, 0)>0))

The help function needs the following modifications:

\alias{setequal}

\description{Performs set union, intersection, difference, equality and
membership on two vectors.}

\usage{
union(x, y)
intersect(x, y)
setdiff(x, y)
setequal(x, y)
is.element(x, y)
}

%% There appears to be an extra tab or other white space in the arguments
%% field.

\examples{
x <- sample(1:20, 10)
y <- sample(3:23, 7)
union(x, y)
intersect(x, y)
setdiff(x, y)
setequal(x, y)
is.element(y, x)
}

Cheers, Jonathan.

Jonathan Rougier                       Science Laboratories
Department of Mathematical Sciences    South Road
University of Durham                   Durham DH1 3LE

"[B]egin upon the precept ... that the things we see are to be 
 weighed in the scale with what we know"  (Meredith, 1879, The Egoist)

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-devel-request@stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._