Skip to content

Duplicates among columns of a data frame

3 messages · Andrew C. Ward, Brian Ripley, Charles C. Berry

#
Dear list,

I have a data frame of survey respondents, a little like this:

set.seed(20081215)
n <- 100
dat <- data.frame(id=1:100,
                   addr1=sample(LETTERS, n, replace=TRUE),
                   addr2=sample(LETTERS, n, replace=TRUE),
                   addr3=sample(LETTERS, n, replace=TRUE))
head(dat)

   id addr1 addr2 addr3
1  1     R     H     Q
2  2     H     C     K
3  3     I     P     S
4  4     A     H     L
5  5     P     Q     P



I wish to detect potential duplicates in the data frame.
In my example, people can have up to three addresses.
If two people have the same address, then there is a
chance that the two entries are duplicates (for instance,
persons 1, 2, and 4 in the sample data have the same
entry "H" so I want to be sure they aren't duplicates).
Person 5 has the same address "P" for addr1 and addr3
but this is not a duplicate, however, since that person
may have the same address in several bits of information.
I'm only concerned about multiple people sharing the
same address.

It's easy to find duplicates within individual columns, but
I'm not sure how to do so across columns. Any advice you
had would be more than welcome. Thanks!


Regards,

Andrew C. Ward

CAPE Centre
Department of Chemical Engineering
The University of Queensland
Brisbane Qld 4072 Australia
#
I think you mean duplicated *rows*, not columns, despite your subject 
line.

See ?dublicated, which has a data.frame method.
On Mon, 15 Dec 2008, Andrew C. Ward wrote:

            

  
    
#
Andrew,

Is this what you seek?


all.addresses <- Reduce( union, dat[-1] )
who.is.here <- sapply( all.addresses,
 	function(x) dat$id[ rowSums(dat[ -1 ] == x ) != 0 ],
 		simplify=FALSE)


If not, try to give us more detail.

HTH,

Chuck
On Mon, 15 Dec 2008, Andrew C. Ward wrote:

            
Charles C. Berry                            (858) 534-2098
                                             Dept of Family/Preventive Medicine
E mailto:cberry at tajo.ucsd.edu	            UC San Diego
http://famprevmed.ucsd.edu/faculty/cberry/  La Jolla, San Diego 92093-0901