I would like to know which rows are duplicates of each other, not
simply that a row is duplicate of another row. In the following
example rows 1 and 3 are duplicates.
> x <- c(1,3,1)
> y <- c(2,4,2)
> z <- c(3,4,3)
> data <- data.frame(x,y,z)
x y z
1 1 2 3
2 3 4 4
3 1 2 3
I can't figure out how to get R to tell me that observation 1 and 3
are the same. It seems like the "duplicated" and "unique" functions
should be able to help me out, but I am stumped.
For instance, if I use "duplicated" ...
> duplicated(data)
[1] FALSE FALSE TRUE
it tells me that row 3 is a duplicate, but not which row it matches.
How do I figure out WHICH row it matches?
And If I use "unique"...
> unique(data)
x y z
1 1 2 3
2 3 4 4
I see that rows 1 and 2 are unique, leaving me to infer that row 3 was
a duplicate, but again it doesn't tell me which row it was a duplicate
of (as far as I can tell). Am I missing something?
How can I determine that row 3 is a duplicate OF ROW 1?
Thanks,
Aaron
which rows are duplicates?
9 messages · Aaron M. Swoboda, Bill Venables, Michael Dewey +2 more
If you sort the data then the duplicated entries will occur in consecutive blocks:
m
x y z 1 1 2 3 2 3 4 4 3 1 2 3
m1 <- m[do.call(order, m), ] m1
x y z 1 1 2 3 3 1 2 3 2 3 4 4
duplicated(m1)
[1] FALSE TRUE FALSE
When you identify the blocks, the row names will tell you where they occur in the original data frame. Bill Venables http://www.cmis.csiro.au/bill.venables/ -----Original Message----- From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of Aaron M. Swoboda Sent: Monday, 30 March 2009 2:07 PM To: r-help at r-project.org Subject: [R] which rows are duplicates? I would like to know which rows are duplicates of each other, not simply that a row is duplicate of another row. In the following example rows 1 and 3 are duplicates. > x <- c(1,3,1) > y <- c(2,4,2) > z <- c(3,4,3) > data <- data.frame(x,y,z) x y z 1 1 2 3 2 3 4 4 3 1 2 3 I can't figure out how to get R to tell me that observation 1 and 3 are the same. It seems like the "duplicated" and "unique" functions should be able to help me out, but I am stumped. For instance, if I use "duplicated" ... > duplicated(data) [1] FALSE FALSE TRUE it tells me that row 3 is a duplicate, but not which row it matches. How do I figure out WHICH row it matches? And If I use "unique"... > unique(data) x y z 1 1 2 3 2 3 4 4 I see that rows 1 and 2 are unique, leaving me to infer that row 3 was a duplicate, but again it doesn't tell me which row it was a duplicate of (as far as I can tell). Am I missing something? How can I determine that row 3 is a duplicate OF ROW 1? Thanks, Aaron ______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
At 05:07 30/03/2009, Aaron M. Swoboda wrote:
I would like to know which rows are duplicates of each other, not simply that a row is duplicate of another row. In the following example rows 1 and 3 are duplicates.
x <- c(1,3,1) y <- c(2,4,2) z <- c(3,4,3) data <- data.frame(x,y,z)
x y z 1 1 2 3 2 3 4 4 3 1 2 3
Does this do what you want? > x <- c(1,3,1) > y <- c(2,4,2) > z <- c(3,4,3) > data <- data.frame(x,y,z) > data.u <- unique(data) > data.u x y z 1 1 2 3 2 3 4 4 > data.u <- cbind(data.u, set = 1:nrow(data.u)) > merge(data, data.u) x y z set 1 1 2 3 1 2 1 2 3 1 3 3 4 4 2 You need to do a bit more work to get them back into the original row order if that is essential.
I can't figure out how to get R to tell me that observation 1 and 3 are the same. It seems like the "duplicated" and "unique" functions should be able to help me out, but I am stumped. For instance, if I use "duplicated" ...
duplicated(data)
[1] FALSE FALSE TRUE it tells me that row 3 is a duplicate, but not which row it matches. How do I figure out WHICH row it matches? And If I use "unique"...
unique(data)
x y z 1 1 2 3 2 3 4 4 I see that rows 1 and 2 are unique, leaving me to infer that row 3 was a duplicate, but again it doesn't tell me which row it was a duplicate of (as far as I can tell). Am I missing something? How can I determine that row 3 is a duplicate OF ROW 1? Thanks, Aaron
Michael Dewey http://www.aghmed.fsnet.co.uk
Michael Dewey wrote:
At 05:07 30/03/2009, Aaron M. Swoboda wrote:
I would like to know which rows are duplicates of each other, not simply that a row is duplicate of another row. In the following example rows 1 and 3 are duplicates.
x <- c(1,3,1) y <- c(2,4,2) z <- c(3,4,3) data <- data.frame(x,y,z)
x y z 1 1 2 3 2 3 4 4 3 1 2 3
i don't have any solution significantly better than what you have
already been given. but i have a warning instead.
in the below, you use both 'duplicated' and 'unique' on data frames, and
the proposed solution relies on the latter. you may want to try to
avoid both when working with data frames; this is because of how they
do (or don't) work.
duplicated (and unique, which calls duplicated) simply pastes the
content of each row into a *string*, and then works on the strings.
this means that NAs in the data frame are converted to "NA"s, and "NA"
== "NA", obviously, so that rows that include NAs and are otherwise
identical will be considered *identical*.
that's not bad (yet), but you should be aware. however, duplicated has
a parameter named 'incomparables', explained in ?duplicated as follows:
"
incomparables: a vector of values that cannot be compared. 'FALSE' is a
special value, meaning that all values can be compared, and
may be the only value accepted for methods other than the
default. It will be coerced internally to the same type as
'x'.
"
and also
"
Values in 'incomparables' will never be marked as duplicated. This
is intended to be used for a fairly small set of values and will
not be efficient for a very large set.
"
that is, for example:
vector = c(NA, NA)
duplicated(vector)
# [1] FALSE TRUE
duplicated(vector), incomparables=NA)
# [1] FALSE FALSE
list = list(NA, NA)
duplicated(list)
# [1] FALSE TRUE
duplicated(list, incomparables=NA)
# [1] FALSE FALSE
what the documentation *fails* to tell you is that the parameter
'incomparables' is defunct in duplicated.data.frame, which you can see
in its source code (below), or in the following example:
# data as above, or any data frame
duplicated(data, incomparables=NA)
# Error in if (!is.logical(incomparables) || incomparables)
.NotYetUsed("incomparables != FALSE") :
# missing value where TRUE/FALSE needed
the error message here is *confusing*. the error is raised because the
author of the code made a mistake and apparently haven't carefully
examined and tested his product; the code goes:
duplicated.data.frame
# function (x, incomparables = FALSE, fromLast = FALSE, ...)
# {
# if (!is.logical(incomparables) || incomparables)
# .NotYetUsed("incomparables != FALSE")
# duplicated(do.call("paste", c(x, sep = "\r")), fromLast = fromLast)
# }
# <environment: namespace:base>
clearly, the intention here is to raise an error with a (still hardly
clear) message as in:
.NotYetUsed("incomparables != FALSE")
# Error: argument 'incomparables != FALSE' is not used (yet)
but instead, if(NA) is evaluated (because '!is.logical(NA) || NA'
evaluates, *obviously*, to NA) and hence the uninformative error message.
take home point: rtfm, *but* don't believe it.
vQ
Does this do what you want?
x <- c(1,3,1) y <- c(2,4,2) z <- c(3,4,3) data <- data.frame(x,y,z) data.u <- unique(data) data.u
x y z 1 1 2 3 2 3 4 4
data.u <- cbind(data.u, set = 1:nrow(data.u)) merge(data, data.u)
x y z set 1 1 2 3 1 2 1 2 3 1 3 3 4 4 2 You need to do a bit more work to get them back into the original row order if that is essential.
I can't figure out how to get R to tell me that observation 1 and 3 are the same. It seems like the "duplicated" and "unique" functions should be able to help me out, but I am stumped. For instance, if I use "duplicated" ...
duplicated(data)
[1] FALSE FALSE TRUE it tells me that row 3 is a duplicate, but not which row it matches. How do I figure out WHICH row it matches? And If I use "unique"...
unique(data)
x y z 1 1 2 3 2 3 4 4 I see that rows 1 and 2 are unique, leaving me to infer that row 3 was a duplicate, but again it doesn't tell me which row it was a duplicate of (as far as I can tell). Am I missing something? How can I determine that row 3 is a duplicate OF ROW 1? Thanks, Aaron
Michael Dewey http://www.aghmed.fsnet.co.uk
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
------------------------------------------------------------------------------- Wacek Kusnierczyk, MD PhD Email: waku at idi.ntnu.no Phone: +47 73591875, +47 72574609 Department of Computer and Information Science (IDI) Faculty of Information Technology, Mathematics and Electrical Engineering (IME) Norwegian University of Science and Technology (NTNU) Sem Saelands vei 7, 7491 Trondheim, Norway Room itv303 Bioinformatics & Gene Regulation Group Department of Cancer Research and Molecular Medicine (IKM) Faculty of Medicine (DMF) Norwegian University of Science and Technology (NTNU) Laboratory Center, Erling Skjalgsons gt. 1, 7030 Trondheim, Norway Room 231.05.060
Wacek Kusnierczyk wrote:
Michael Dewey wrote:
At 05:07 30/03/2009, Aaron M. Swoboda wrote:
I would like to know which rows are duplicates of each other, not
simply that a row is duplicate of another row. In the following
example rows 1 and 3 are duplicates.
x <- c(1,3,1)
y <- c(2,4,2)
z <- c(3,4,3)
data <- data.frame(x,y,z)
x y z
1 1 2 3
2 3 4 4
3 1 2 3
i don't have any solution significantly better than what you have already been given.
i now seem to have one:
# dummy data
data = data.frame(x=sample(1:2, 5, replace=TRUE), y=sample(1:2, 5,
replace=TRUE))
# add a class column; identical rows have the same class id
data$class = local({
rows = do.call('paste', c(data, sep='\r'))
with(
rle(sort(rows)),
rep(1:length(values), lengths)[rank(rows)] ) })
data
# x y class
# 1 2 2 3
# 2 2 1 2
# 3 2 1 2
# 4 1 2 1
# 5 2 2 3
this approach seems to be roughly comparable to michael's, depending on
the shape (and size?) of the input:
# dummy data frame, just integers
n = 100; m = 100
data = as.data.frame(
matrix(nrow=n, ncol=m,
sample(n, m*n, replace=TRUE)))
# do a simple benchmarking
library(rbenchmark)
benchmark(replications=100, order='elapsed', columns=c('test',
'elapsed'),
waku=local({
rows = do.call('paste', c(data, sep='\r'))
data$class = with(
rle(sort(rows)),
rep(1:length(values), lengths)[rank(rows)] ) }),
mide=local({
unique = unique(data)
data = merge(data, cbind(unique, class=1:nrow(unique))) }))
# test elapsed
# 1 waku 0.503
# 2 mide 3.269
and for m = 10 and n = 1000 i get:
# test elapsed
# 1 waku 0.571
# 2 mide 15.836
while for m = 1000 and n = 10 i get:
# test elapsed
# 1 waku 1.110
# 2 mide 2.461
the type of the content should not have any impact on the ratio (pure
guess, no testing done).
whether my approach is more intuitive is arguable. note that, unlike in
michael's solution, the final result (the data frame with a class column
added) is in the original order. (and sorting would add a performance
penalty in the other case.)
my previous remarks about the treatment on NAs still apply; the
do.call('paste', ... is taken from duplicated.data.frame.
regards,
vQ
Does this do what you want?
x <- c(1,3,1)
y <- c(2,4,2)
z <- c(3,4,3)
data <- data.frame(x,y,z)
data.u <- unique(data)
data.u
x y z
1 1 2 3
2 3 4 4
data.u <- cbind(data.u, set = 1:nrow(data.u))
merge(data, data.u)
x y z set 1 1 2 3 1 2 1 2 3 1 3 3 4 4 2 You need to do a bit more work to get them back into the original row order if that is essential.
Wacek Kusnierczyk wrote:
Wacek Kusnierczyk wrote:
Michael Dewey wrote:
At 05:07 30/03/2009, Aaron M. Swoboda wrote:
I would like to know which rows are duplicates of each other, not
simply that a row is duplicate of another row. In the following
example rows 1 and 3 are duplicates.
x <- c(1,3,1)
y <- c(2,4,2)
z <- c(3,4,3)
data <- data.frame(x,y,z)
x y z
1 1 2 3
2 3 4 4
3 1 2 3
i don't have any solution significantly better than what you have already been given.
i now seem to have one:
# dummy data
data = data.frame(x=sample(1:2, 5, replace=TRUE), y=sample(1:2, 5,
replace=TRUE))
# add a class column; identical rows have the same class id
data$class = local({
rows = do.call('paste', c(data, sep='\r'))
with(
rle(sort(rows)),
rep(1:length(values), lengths)[rank(rows)] ) })
data
# x y class
# 1 2 2 3
# 2 2 1 2
# 3 2 1 2
# 4 1 2 1
# 5 2 2 3
another approach (maybe a bit cleaner) seems to be:
data <- data.frame(x=sample(1:2, 5, replace=TRUE), y=sample(1:2, 5,
replace = TRUE))
vals <- do.call('paste', c(data, sep = '\r'))
data$class <- match(vals, unique(vals))
data
I have tried benchmarking it.
Best,
Dimitris
this approach seems to be roughly comparable to michael's, depending on
the shape (and size?) of the input:
# dummy data frame, just integers
n = 100; m = 100
data = as.data.frame(
matrix(nrow=n, ncol=m,
sample(n, m*n, replace=TRUE)))
# do a simple benchmarking
library(rbenchmark)
benchmark(replications=100, order='elapsed', columns=c('test',
'elapsed'),
waku=local({
rows = do.call('paste', c(data, sep='\r'))
data$class = with(
rle(sort(rows)),
rep(1:length(values), lengths)[rank(rows)] ) }),
mide=local({
unique = unique(data)
data = merge(data, cbind(unique, class=1:nrow(unique))) }))
# test elapsed
# 1 waku 0.503
# 2 mide 3.269
and for m = 10 and n = 1000 i get:
# test elapsed
# 1 waku 0.571
# 2 mide 15.836
while for m = 1000 and n = 10 i get:
# test elapsed
# 1 waku 1.110
# 2 mide 2.461
the type of the content should not have any impact on the ratio (pure
guess, no testing done).
whether my approach is more intuitive is arguable. note that, unlike in
michael's solution, the final result (the data frame with a class column
added) is in the original order. (and sorting would add a performance
penalty in the other case.)
my previous remarks about the treatment on NAs still apply; the
do.call('paste', ... is taken from duplicated.data.frame.
regards,
vQ
Does this do what you want?
x <- c(1,3,1)
y <- c(2,4,2)
z <- c(3,4,3)
data <- data.frame(x,y,z)
data.u <- unique(data)
data.u
x y z
1 1 2 3
2 3 4 4
data.u <- cbind(data.u, set = 1:nrow(data.u))
merge(data, data.u)
x y z set 1 1 2 3 1 2 1 2 3 1 3 3 4 4 2 You need to do a bit more work to get them back into the original row order if that is essential.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Dimitris Rizopoulos Assistant Professor Department of Biostatistics Erasmus University Medical Center Address: PO Box 2040, 3000 CA Rotterdam, the Netherlands Tel: +31/(0)10/7043478 Fax: +31/(0)10/7043014
Dimitris Rizopoulos wrote:
Wacek Kusnierczyk wrote:
Wacek Kusnierczyk wrote:
Michael Dewey wrote:
At 05:07 30/03/2009, Aaron M. Swoboda wrote:
I would like to know which rows are duplicates of each other, not
simply that a row is duplicate of another row. In the following
example rows 1 and 3 are duplicates.
x <- c(1,3,1)
y <- c(2,4,2)
z <- c(3,4,3)
data <- data.frame(x,y,z)
x y z
1 1 2 3
2 3 4 4
3 1 2 3
i don't have any solution significantly better than what you have already been given.
i now seem to have one:
# dummy data
data = data.frame(x=sample(1:2, 5, replace=TRUE), y=sample(1:2, 5,
replace=TRUE))
# add a class column; identical rows have the same class id
data$class = local({
rows = do.call('paste', c(data, sep='\r'))
with(
rle(sort(rows)),
rep(1:length(values), lengths)[rank(rows)] ) })
data
# x y class
# 1 2 2 3
# 2 2 1 2
# 3 2 1 2
# 4 1 2 1
# 5 2 2 3
another approach (maybe a bit cleaner) seems to be:
data <- data.frame(x=sample(1:2, 5, replace=TRUE), y=sample(1:2, 5,
replace = TRUE))
vals <- do.call('paste', c(data, sep = '\r'))
data$class <- match(vals, unique(vals))
data
I have tried benchmarking it.
sorry, I wanted to write: I have *not* tried benchmarking it. Best, Dimitris
Best, Dimitris
this approach seems to be roughly comparable to michael's, depending on
the shape (and size?) of the input:
# dummy data frame, just integers
n = 100; m = 100
data = as.data.frame(
matrix(nrow=n, ncol=m,
sample(n, m*n, replace=TRUE)))
# do a simple benchmarking
library(rbenchmark)
benchmark(replications=100, order='elapsed', columns=c('test',
'elapsed'),
waku=local({
rows = do.call('paste', c(data, sep='\r'))
data$class = with(
rle(sort(rows)),
rep(1:length(values), lengths)[rank(rows)] ) }),
mide=local({
unique = unique(data)
data = merge(data, cbind(unique, class=1:nrow(unique))) }))
# test elapsed
# 1 waku 0.503
# 2 mide 3.269
and for m = 10 and n = 1000 i get:
# test elapsed
# 1 waku 0.571
# 2 mide 15.836
while for m = 1000 and n = 10 i get:
# test elapsed
# 1 waku 1.110
# 2 mide 2.461
the type of the content should not have any impact on the ratio (pure
guess, no testing done).
whether my approach is more intuitive is arguable. note that, unlike in
michael's solution, the final result (the data frame with a class column
added) is in the original order. (and sorting would add a performance
penalty in the other case.)
my previous remarks about the treatment on NAs still apply; the
do.call('paste', ... is taken from duplicated.data.frame.
regards,
vQ
Does this do what you want?
x <- c(1,3,1)
y <- c(2,4,2)
z <- c(3,4,3)
data <- data.frame(x,y,z)
data.u <- unique(data)
data.u
x y z 1 1 2 3 2 3 4 4
data.u <- cbind(data.u, set = 1:nrow(data.u))
merge(data, data.u)
x y z set 1 1 2 3 1 2 1 2 3 1 3 3 4 4 2 You need to do a bit more work to get them back into the original row order if that is essential.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Dimitris Rizopoulos Assistant Professor Department of Biostatistics Erasmus University Medical Center Address: PO Box 2040, 3000 CA Rotterdam, the Netherlands Tel: +31/(0)10/7043478 Fax: +31/(0)10/7043014
Dimitris Rizopoulos wrote:
Wacek Kusnierczyk wrote:
Wacek Kusnierczyk wrote:
Michael Dewey wrote:
At 05:07 30/03/2009, Aaron M. Swoboda wrote:
I would like to know which rows are duplicates of each other, not
simply that a row is duplicate of another row. In the following
example rows 1 and 3 are duplicates.
x <- c(1,3,1)
y <- c(2,4,2)
z <- c(3,4,3)
data <- data.frame(x,y,z)
x y z
1 1 2 3
2 3 4 4
3 1 2 3
i don't have any solution significantly better than what you have already been given.
i now seem to have one:
# dummy data
data = data.frame(x=sample(1:2, 5, replace=TRUE), y=sample(1:2, 5,
replace=TRUE))
# add a class column; identical rows have the same class id
data$class = local({
rows = do.call('paste', c(data, sep='\r'))
with(
rle(sort(rows)),
rep(1:length(values), lengths)[rank(rows)] ) })
data
# x y class
# 1 2 2 3
# 2 2 1 2
# 3 2 1 2
# 4 1 2 1
# 5 2 2 3
another approach (maybe a bit cleaner) seems to be:
data <- data.frame(x=sample(1:2, 5, replace=TRUE), y=sample(1:2, 5,
replace = TRUE))
vals <- do.call('paste', c(data, sep = '\r'))
data$class <- match(vals, unique(vals))
data
wow, cool! this seems unbeatable ;) i guess it can't be slower than any of the others. vQ
Dimitris Rizopoulos wrote:
another approach (maybe a bit cleaner) seems to be:
data <- data.frame(x=sample(1:2, 5, replace=TRUE), y=sample(1:2, 5,
replace = TRUE))
vals <- do.call('paste', c(data, sep = '\r'))
data$class <- match(vals, unique(vals))
data
I have tried benchmarking it.
sorry, I wanted to write: I have *not* tried benchmarking it.
# dummy data frame, just integers
n = 100; m = 100
data = as.data.frame(
matrix(nrow=n, ncol=m,
sample(n, m*n, replace=TRUE)))
# do a simple benchmarking
library(rbenchmark)
benchmark(
replications=100,
order='elapsed',
columns=c('test', 'elapsed'),
waku=local({
rows = do.call('paste', c(data, sep='\r'))
data$class = with(
rle(sort(rows)),
rep(1:length(values), lengths)[rank(rows)] ) }),
diri=local({
values = do.call('paste', c(data, sep='\r'))
data$class = match(values, unique(values)) }) )
# test elapsed
# 2 diri 0.43
# 1 waku 0.52
comparable for m=n=100 (and even better for n >> m), but way cleaner
code, and the class ids are now better sorted. that's collaborative
problem solving ;)
best,
vQ