Hello,
I have been interested in setdiff() for data frames that operates
row-wise. I looked in the documentation, mailing lists, etc., and
didn't find exactly the right thing. Given data frames A, B with the
same columns, the goal is to extract the rows that are in A, but not
in B. Of course, one can usually do setdiff(rownames(A), rownames(B))
but that is cheating. :-)
I played around a little bit and came up with
setdiff.data.frame = function(A, B){
g <- function( y, B){
any( apply(B, 1, FUN = function(x)
identical(all.equal(x, y), TRUE) ) ) }
unique( A[ !apply(A, 1, FUN = function(t) g(t, B) ), ] )
}
I am sure that somebody can do this a better/faster way... any ideas?
Any chance we could get a data.frame method for set.diff in future R
versions? (The notion of "set" is somewhat ambiguous with respect to
rows, columns, and entries in the data frame case.)
Jay
P.S. You can see what I'm looking for with
A <- expand.grid( 1:3, 1:3 )
B <- A[ 2:5, ]
setdiff.data.frame(A,B)
***************************************************
G. Jay Kerns, Ph.D.
Assistant Professor / Statistics Coordinator
Department of Mathematics & Statistics
Youngstown State University
Youngstown, OH 44555-0002 USA
Office: 1035 Cushwa Hall
Phone: (330) 941-3310 Office (voice mail)
-3302 Department
-3170 FAX
E-mail: gkerns at ysu.edu
http://www.cc.ysu.edu/~gjkerns/
setdiff for data frames
4 messages · Charles C. Berry, G. Jay Kerns, Brian Ripley
On Mon, 10 Dec 2007, G. Jay Kerns wrote:
Hello,
I have been interested in setdiff() for data frames that operates
row-wise. I looked in the documentation, mailing lists, etc., and
didn't find exactly the right thing. Given data frames A, B with the
same columns, the goal is to extract the rows that are in A, but not
in B. Of course, one can usually do setdiff(rownames(A), rownames(B))
but that is cheating. :-)
I played around a little bit and came up with
setdiff.data.frame = function(A, B){
g <- function( y, B){
any( apply(B, 1, FUN = function(x)
identical(all.equal(x, y), TRUE) ) ) }
unique( A[ !apply(A, 1, FUN = function(t) g(t, B) ), ] )
}
I am sure that somebody can do this a better/faster way... any ideas?
setdiff.data.frame <-
function(A,B) A[ !duplicated( rbind(B,A) )[ -seq_len(nrow(B))] , ]
This ignores rownames(A) which may not be what is wanted in every case.
HTH,
Chuck
Any chance we could get a data.frame method for set.diff in future R versions? (The notion of "set" is somewhat ambiguous with respect to rows, columns, and entries in the data frame case.) Jay P.S. You can see what I'm looking for with A <- expand.grid( 1:3, 1:3 ) B <- A[ 2:5, ] setdiff.data.frame(A,B) *************************************************** G. Jay Kerns, Ph.D. Assistant Professor / Statistics Coordinator Department of Mathematics & Statistics Youngstown State University Youngstown, OH 44555-0002 USA Office: 1035 Cushwa Hall Phone: (330) 941-3310 Office (voice mail) -3302 Department -3170 FAX E-mail: gkerns at ysu.edu http://www.cc.ysu.edu/~gjkerns/
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Charles C. Berry (858) 534-2098
Dept of Family/Preventive Medicine
E mailto:cberry at tajo.ucsd.edu UC San Diego
http://famprevmed.ucsd.edu/faculty/cberry/ La Jolla, San Diego 92093-0901
An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/r-devel/attachments/20071210/4b27a6bd/attachment.pl
On Mon, 10 Dec 2007, Charles C. Berry wrote:
On Mon, 10 Dec 2007, G. Jay Kerns wrote:
Hello,
I have been interested in setdiff() for data frames that operates
row-wise. I looked in the documentation, mailing lists, etc., and
didn't find exactly the right thing. Given data frames A, B with the
same columns, the goal is to extract the rows that are in A, but not
in B. Of course, one can usually do setdiff(rownames(A), rownames(B))
but that is cheating. :-)
I played around a little bit and came up with
setdiff.data.frame = function(A, B){
g <- function( y, B){
any( apply(B, 1, FUN = function(x)
identical(all.equal(x, y), TRUE) ) ) }
unique( A[ !apply(A, 1, FUN = function(t) g(t, B) ), ] )
}
I am sure that somebody can do this a better/faster way... any ideas?
setdiff.data.frame <- function(A,B) A[ !duplicated( rbind(B,A) )[ -seq_len(nrow(B))] , ] This ignores rownames(A) which may not be what is wanted in every case.
I was about to suggest using the approach taken by duplicated.data.frame,
(which is to 'hash' the rows to a character vector) then call setdiff.
E.g.
a <- do.call("paste", c(A, sep = "\r"))
b <- do.call("paste", c(B, sep = "\r"))
A[match(setdiff(a, b),a), ]
Note that apply() is intended for matrices (not data frames) and the
version given can do a horrendous amount of coercion, whereas the above
does it only once.
HTH, Chuck
Any chance we could get a data.frame method for set.diff in future R versions? (The notion of "set" is somewhat ambiguous with respect to rows, columns, and entries in the data frame case.)
No chance: if you have not found it in the archives, it is too rare a request.
Jay P.S. You can see what I'm looking for with A <- expand.grid( 1:3, 1:3 ) B <- A[ 2:5, ] setdiff.data.frame(A,B)
Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595