Should one of the suggestion be implemented as the unique method for data.frame? Or maybe uniquerows.data.frame? Just a thought... This is probably nearly a FAQ. Andy -----Original Message----- From: Prof Brian Ripley [mailto:ripley at stats.ox.ac.uk] Sent: Wednesday, October 31, 2001 6:54 AM To: Peter Dalgaard BSA Cc: Gary Collins; r-help Subject: Re: [R] removing duplicated rows from a data.frame
On 31 Oct 2001, Peter Dalgaard BSA wrote:
"Gary Collins" <gco at eortc.be> writes:
Dear all, Sorry for the simplicity of the question, but how does one go about removing duplicated rows in a data.frame? I'm looking for a quick and simple solution, as my data.frames are relatively large (50000 by 50). I've racked my brain and searched the help files and found nothing useful or quick, only duplicated() and unique() which work only work on lists.
Nontrivial I think. Something like
eql <- function(x,y)ifelse(is.na(x),is.na(y),ifelse(is.na(y),FALSE,x==y))
o <- do.call("order",dfr)
isdup <- do.call("cbind",lapply(dfr[o,],function(x)eql(x,c(x[-1],NA))))
all.dup <- apply(isdup, 1, all)
all.dup[o] <- all.dup
dfr[!all.dup]
i.e. sort the dataframe, figure out which rows have all values
identical to their successor. This gives logical vector, but in the
order of the sorted values, so reorder it. Finally select nondups. As
a "bonus feature", I think this will also remove any row containing all
NA's...
A major stumbling block is that you'll want two NAs to compare equal,
hence the eql() function.
Actually, I think you can do away with the isdup array and do
all.dup <- do.call("pmin",lapply(dfr[o,],function(x)eql(x,c(x[-1],NA))))
and there may be further cleanups possible.
One dirty trick which is much quicker but not quite as reliable is
duplicated(do.call("paste",dfr))
(watch out for character strings with embedded spaces and underflowing
differences in numeric data!)
merge.data.frame does the equivalent of
mypaste <- function(...) paste(..., sep="\r")
do.call("mypaste", dfr)
which seems reliable enough. Identical numerical data should
as.character identically, and embedded CRs are very rare in R character
strings.
As a test
data(iris)
duplicated(do.call("mypaste", iris))
(or duplicated(do.call("paste", c(iris, sep="\r"))) if you prefer a
one-liner).
Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272860 (secr) Oxford OX1 3TG, UK Fax: +44 1865 272595 -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-. -.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._. _._ -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._