Skip to content

Duplicate rows when I combine two data.frames with merge!

4 messages · RKinzer, Sarah Goslee, Ista Zahn +1 more

#
Hello all,

First I have done extensive searches on this forum and others and nothing
seems to work.  So I decided to post thinking someone could point me to the
write post or give me some help.

I have drawn a 100 samples from a fictitious population (N=1000), and then
randomly selected 25% of the 100 samples.  I would like to now merge the
data.frame from the 100 samples with the data.frame for the 25 individuals
from the sample.  When I do this with the following code I get duplicate
rows, when I should have at most is 100.

x<-mapply(rnorm,1000,c(54,78,89),c(3.5,5.5,5.9))  #sets up 1000 random
numbers for age 3,4,5
x.3<-sample(x[,1],60)  #randomly selects 60 lengths from age 3
x.4<-sample(x[,2],740)
x.5<-sample(x[,3],200)
length<-c(x.3,x.4,x.5)  
length<-round(length,digits=0)  #rounds lengths to whole number
age3<-rep(3,60) 
age4<-rep(4,740)
age5<-rep(5,200)
age<-c(age3,age4,age5)  #combines ages into one vector
unique<-1:1000  #gives each fish a unique id
pop<-data.frame(unique,length,age) 
pop<-pop[sample(1:1000,size=1000,replace=FALSE),]  #randomized the order of
pop
c.one<-pop[sample(1:1000,size=100,replace=TRUE),] 
a.one.qtr<-c.one[sample(1:100,size=25,replace=TRUE),] 
merge<-merge(c.one,a.one.qtr,by="unique",all=TRUE)

What I would ultimately like to have is one row for all 100 in the sample
and three columns (unique, length, age).  And then some way to identify the
25 individual selected rows.

Thank you upfront for any help.  I have been stuck for days.

Ryan



--
View this message in context: http://r.789695.n4.nabble.com/Duplicate-rows-when-I-combine-two-data-frames-with-merge-tp4362685p4362685.html
Sent from the R help mailing list archive at Nabble.com.
#
Hi,

Why do you need to merge them? c.one contains what I think you want,
and then you want to randomly select 25 rows from that without replacement:

c.one <- cbind(c.one, a.qtr = sample(c(rep(TRUE, 25), rep(FALSE, 75))))
unique length age a.qtr
649    649     71   4  TRUE
200    200     79   4 FALSE
410    410     82   4  TRUE
620    620     73   4 FALSE
723    723     81   4 FALSE
855    855     96   5 FALSE

If what you mean by the subject line is that there are duplicate rows in the
merged data frame, of course there are, because there are duplicated rows
in c.one, because you selected sampling with replacement when selecting
rows from pop to make c.one.

Something to be VERY careful of: length, merge and unique are all base
functions, and shouldn't be used as variable names. After you named
something merge, what happens when you try to use merge()?

If I'm misunderstanding the question, then please try to explain more clearly
what you are looking for.

Sarah
On Mon, Feb 6, 2012 at 3:29 PM, RKinzer <ryank at nezperce.org> wrote:

  
    
  
#
Hi Ryan,

You're getting this result because five of the rows in a.one.qtr are
duplicates, as a results of sampling with replacement:

length(which(duplicated(a.one.qtr)))
[1] 5

The relevant section of the documentation (see ?merge) reads "The rows
in the two data frames that match on the specified columns are
extracted, and joined together. *If there is more than one match, all
possible matches contribute one row each*" (emphasis mine).

Best,
Ista
On Mon, Feb 6, 2012 at 3:29 PM, RKinzer <ryank at nezperce.org> wrote:
#
On Mon, Feb 06, 2012 at 12:29:53PM -0800, RKinzer wrote:
The function merge() here includes additional columns, which
contain in the rows from a.one.qtr copies of the columns length
and age. So, the same values appear twice in the row. I am
not sure, whether this is intended.

Another representation of the subsample a.one.qtr may be done
by adding a column to c.one, which specifies, how many times
was the row selected to a.one.qtr. For example as follows.

  a.one.qtr2 <- sample(1:100,size=25,replace=TRUE)
  c.one2 <- cbind(c.one, selected=tabulate(a.one.qtr2, nbins=100))

  # a random result may look like

  c.one2
  
        unique length age selected
  657      657     81   4        0
  488      488     78   4        1
  886      886     85   5        0
  448      448     82   4        0
  292      292     80   4        0
  431      431     78   4        0
  683      683     82   4        0
  32        32     56   3        2
  740      740     80   4        0
  519      519     81   4        1
  986      986     88   5        0
  437      437     84   4        0
  247      247     88   4        0
  122      122     73   4        0
  ...

The sum of the column "selected" is 25.

Hope this helps.

Petr Savicky.