Skip to content

Random sample from a data frame where ID column values don't match the values in an ID column in a second data frame

7 messages · inkhorn, David Winsemius

#
Hello,

Let's say I've drawn a random sample (sample1.df) from a large data frame
(main.df), and I want to create a second random sample (sample2.df) where
the values in its ID column *are not* in the equivalent ID column in the
first sample (sample1.df).  How would I go about doing this?

In other words:

The values in sample2.df$ID *are not found* in sample1.df$ID,  and both
samples are drawn from main.df.

Thanks in advance,
Matt Dubins

--
View this message in context: http://r.789695.n4.nabble.com/Random-sample-from-a-data-frame-where-ID-column-values-don-t-match-the-values-in-an-ID-column-in-a-se-tp4516448p4516448.html
Sent from the R help mailing list archive at Nabble.com.
#
On Mar 29, 2012, at 2:37 PM, inkhorn wrote:

            
?"%in%"

sample2.df <- main.df[ ! main.df[, "ID"] %in% sample1.df[, "ID"] , ]
David Winsemius, MD
West Hartford, CT
#
When I use that exact syntax (with the ID variable names in quotes within the
square brackets after a comma) it just doesn't work.  Also, I'm looking for
a random sample, not all possible rows with ID values that don't match the
second data frame.

--
View this message in context: http://r.789695.n4.nabble.com/Random-sample-from-a-data-frame-where-ID-column-values-don-t-match-the-values-in-an-ID-column-in-a-se-tp4516448p4516866.html
Sent from the R help mailing list archive at Nabble.com.
#
On Mar 29, 2012, at 4:22 PM, inkhorn wrote:

            
Which you didn't include as context. Please read the Posting Guide.
You are the one the offered "ID" as the column name.
Well , without a data example it certainly wasn't tested code. Please  
read the Posting Guide.
Again we do not have the context , but my memory does not include that  
as part of the request, but my memory is fallible. ....  which I why  
we ask everyone to read the Posting Guide.
Nabble is NOT the home of R help mailing list and it is not its Archive.

https://stat.ethz.ch/pipermail/r-help/
AND .....

  
    
#
Okay, here's some sample code:

ID = c(1,2,3,"A1",5,6,"A2",8,9,"A3")
fakedata = rnorm(10, 5, .5)
main.df = data.frame(ID,fakedata)

results for my data frame:
ID     fakedata
1   1     5.024332
2   2     4.752943
3   3     5.408618
4  A1   5.362838
5   5    5.158660
6   6    4.658235
7  A2   5.389601
8   8    4.998249
9   9    5.248517
10 A3 4.159490

sample1.df = main.df[sample(nrow(main.df), 4), ]
ID     fakedata
5  5     5.158660
9  9     5.248517
4 A1   5.362838
8  8    4.998249

Here's what happens when I put a comma before the variable ID:
Error in `[.data.frame`(main.df, !main.df[, "ID"] %in% sample1.df[, "ID"]) : 
  undefined columns selected

Here's what happens when I exclude the comma:

sample2.df = main.df[sample(nrow(main.df[! main.df["ID"] %in%
sample1.df["ID"]]), 5),]
ID     fakedata
8   8     4.998249
1   1     5.024332
3   3     5.408618
5   5     5.158660
10 A3  4.159490

As you can see, one way I get nothing other than an error, the other way I
get a sample that doesn't exclude rows that were already included in the 1st
sample.  

Thanks,
Matt Dubins

--
View this message in context: http://r.789695.n4.nabble.com/Random-sample-from-a-data-frame-where-ID-column-values-don-t-match-the-values-in-an-ID-column-in-a-se-tp4516448p4518878.html
Sent from the R help mailing list archive at Nabble.com.
#
Okay thanks to your help I figured it out and stuck the code in a function:

df.sample.exIDs = function(main.df, sample1.df, n, ID1.name, ID2.name) {
  main.ID1.notin.ID2 = main.df[!main.df[,ID1.name] %in%
sample1.df[,ID2.name],]
  sample2.df = main.ID1.notin.ID2[sample(nrow(main.ID1.notin.ID2), size=n),]
  return(sample2.df)
}

Cheers,
Matt Dubins

--
View this message in context: http://r.789695.n4.nabble.com/Random-sample-from-a-data-frame-where-ID-column-values-don-t-match-the-values-in-an-ID-column-in-a-se-tp4516448p4518972.html
Sent from the R help mailing list archive at Nabble.com.
#
On Mar 30, 2012, at 8:17 AM, inkhorn wrote:

            
That was not the code I offered, which had no error:

 > sample2.df <- main.df[ ! main.df[, "ID"] %in% sample1.df[, "ID"] , ]
 > sample2.df
   ID fakedata
2  2 5.225752
4 A1 4.788752
5  5 3.973376
6  6 5.565669
8  8 5.369974
9  9 5.954552

If you want to further sub-sample from that complement which I offered  
(and that _was_ a random sample from the main dataset albeit not the  
particular sample you wanted) , then it is available for further  
sampling.

 > sample2.df[ sample(nrow(sample2.df), 3), ]
   ID fakedata
2  2 5.225752
8  8 5.369974
6  6 5.565669
You cannot do both steps in one line using that exact strategy. But  
you can "chain" uses of "[".  You could for instance have constructed  
indexes (indices seems to be disappearing from the English languages):

idx <- sample(nrow(main.df), 4)
subset1 <- main.df[ idx, ]
subset2 <- main.df[-idx, ][sample(nrow(main.df)-nrow(subset1), 3), ]

 > subset2
   ID fakedata
6  6 5.565669
5  5 3.973376
2  2 5.225752