Hello, Let's say I've drawn a random sample (sample1.df) from a large data frame (main.df), and I want to create a second random sample (sample2.df) where the values in its ID column *are not* in the equivalent ID column in the first sample (sample1.df). How would I go about doing this? In other words: The values in sample2.df$ID *are not found* in sample1.df$ID, and both samples are drawn from main.df. Thanks in advance, Matt Dubins -- View this message in context: http://r.789695.n4.nabble.com/Random-sample-from-a-data-frame-where-ID-column-values-don-t-match-the-values-in-an-ID-column-in-a-se-tp4516448p4516448.html Sent from the R help mailing list archive at Nabble.com.
Random sample from a data frame where ID column values don't match the values in an ID column in a second data frame
7 messages · inkhorn, David Winsemius
On Mar 29, 2012, at 2:37 PM, inkhorn wrote:
Hello, Let's say I've drawn a random sample () from a large data frame (main.df), and I want to create a second random sample (sample2.df) where the values in its ID column *are not* in the equivalent ID column in the first sample (sample1.df). How would I go about doing this? In other words: The values in sample2.df$ID *are not found* in sample1.df$ID, and both samples are drawn from main.df.
?"%in%" sample2.df <- main.df[ ! main.df[, "ID"] %in% sample1.df[, "ID"] , ]
Thanks in advance, Matt Dubins -- View this message in context: http://r.789695.n4.nabble.com/Random-sample-from-a-data-frame-where-ID-column-values-don-t-match-the-values-in-an-ID-column-in-a-se-tp4516448p4516448.html Sent from the R help mailing list archive at Nabble.com.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
David Winsemius, MD West Hartford, CT
When I use that exact syntax (with the ID variable names in quotes within the square brackets after a comma) it just doesn't work. Also, I'm looking for a random sample, not all possible rows with ID values that don't match the second data frame. -- View this message in context: http://r.789695.n4.nabble.com/Random-sample-from-a-data-frame-where-ID-column-values-don-t-match-the-values-in-an-ID-column-in-a-se-tp4516448p4516866.html Sent from the R help mailing list archive at Nabble.com.
On Mar 29, 2012, at 4:22 PM, inkhorn wrote:
When I use that exact syntax
Which you didn't include as context. Please read the Posting Guide.
(with the ID variable names
You are the one the offered "ID" as the column name.
in quotes within the square brackets after a comma) it just doesn't work.
Well , without a data example it certainly wasn't tested code. Please read the Posting Guide.
Also, I'm looking for a random sample, not all possible rows with ID values that don't match the second data frame.
Again we do not have the context , but my memory does not include that as part of the request, but my memory is fallible. .... which I why we ask everyone to read the Posting Guide.
-- View this message in context: http://r.789695.n4.nabble.com/Random-sample-from-a-data-frame-where-ID-column-values-don-t-match-the-values-in-an-ID-column-in-a-se-tp4516448p4516866.html Sent from the R help mailing list archive at Nabble.com.
Nabble is NOT the home of R help mailing list and it is not its Archive. https://stat.ethz.ch/pipermail/r-help/
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
AND .....
and provide commented, minimal, self-contained, reproducible code.
David Winsemius, MD West Hartford, CT
Okay, here's some sample code: ID = c(1,2,3,"A1",5,6,"A2",8,9,"A3") fakedata = rnorm(10, 5, .5) main.df = data.frame(ID,fakedata) results for my data frame:
main.df
ID fakedata 1 1 5.024332 2 2 4.752943 3 3 5.408618 4 A1 5.362838 5 5 5.158660 6 6 4.658235 7 A2 5.389601 8 8 4.998249 9 9 5.248517 10 A3 4.159490 sample1.df = main.df[sample(nrow(main.df), 4), ]
sample1.df
ID fakedata 5 5 5.158660 9 9 5.248517 4 A1 5.362838 8 8 4.998249 Here's what happens when I put a comma before the variable ID:
sample2.df = main.df[sample(nrow(main.df[! main.df[,"ID"] %in% sample1.df[,"ID"]]), 5),]
Error in `[.data.frame`(main.df, !main.df[, "ID"] %in% sample1.df[, "ID"]) : undefined columns selected Here's what happens when I exclude the comma: sample2.df = main.df[sample(nrow(main.df[! main.df["ID"] %in% sample1.df["ID"]]), 5),]
sample2.df
ID fakedata 8 8 4.998249 1 1 5.024332 3 3 5.408618 5 5 5.158660 10 A3 4.159490 As you can see, one way I get nothing other than an error, the other way I get a sample that doesn't exclude rows that were already included in the 1st sample. Thanks, Matt Dubins -- View this message in context: http://r.789695.n4.nabble.com/Random-sample-from-a-data-frame-where-ID-column-values-don-t-match-the-values-in-an-ID-column-in-a-se-tp4516448p4518878.html Sent from the R help mailing list archive at Nabble.com.
Okay thanks to your help I figured it out and stuck the code in a function:
df.sample.exIDs = function(main.df, sample1.df, n, ID1.name, ID2.name) {
main.ID1.notin.ID2 = main.df[!main.df[,ID1.name] %in%
sample1.df[,ID2.name],]
sample2.df = main.ID1.notin.ID2[sample(nrow(main.ID1.notin.ID2), size=n),]
return(sample2.df)
}
Cheers,
Matt Dubins
--
View this message in context: http://r.789695.n4.nabble.com/Random-sample-from-a-data-frame-where-ID-column-values-don-t-match-the-values-in-an-ID-column-in-a-se-tp4516448p4518972.html
Sent from the R help mailing list archive at Nabble.com.
On Mar 30, 2012, at 8:17 AM, inkhorn wrote:
Okay, here's some sample code: ID = c(1,2,3,"A1",5,6,"A2",8,9,"A3") fakedata = rnorm(10, 5, .5) main.df = data.frame(ID,fakedata) results for my data frame:
main.df
ID fakedata 1 1 5.024332 2 2 4.752943 3 3 5.408618 4 A1 5.362838 5 5 5.158660 6 6 4.658235 7 A2 5.389601 8 8 4.998249 9 9 5.248517 10 A3 4.159490 sample1.df = main.df[sample(nrow(main.df), 4), ]
sample1.df
ID fakedata 5 5 5.158660 9 9 5.248517 4 A1 5.362838 8 8 4.998249 Here's what happens when I put a comma before the variable ID:
sample2.df = main.df[sample(nrow(main.df[! main.df[,"ID"] %in% sample1.df[,"ID"]]), 5),]
Error in `[.data.frame`(main.df, !main.df[, "ID"] %in% sample1.df[, "ID"]) : undefined columns selected
That was not the code I offered, which had no error: > sample2.df <- main.df[ ! main.df[, "ID"] %in% sample1.df[, "ID"] , ] > sample2.df ID fakedata 2 2 5.225752 4 A1 4.788752 5 5 3.973376 6 6 5.565669 8 8 5.369974 9 9 5.954552 If you want to further sub-sample from that complement which I offered (and that _was_ a random sample from the main dataset albeit not the particular sample you wanted) , then it is available for further sampling. > sample2.df[ sample(nrow(sample2.df), 3), ] ID fakedata 2 2 5.225752 8 8 5.369974 6 6 5.565669
Here's what happens when I exclude the comma: sample2.df = main.df[sample(nrow(main.df[! main.df["ID"] %in% sample1.df["ID"]]), 5),]
You cannot do both steps in one line using that exact strategy. But you can "chain" uses of "[". You could for instance have constructed indexes (indices seems to be disappearing from the English languages): idx <- sample(nrow(main.df), 4) subset1 <- main.df[ idx, ] subset2 <- main.df[-idx, ][sample(nrow(main.df)-nrow(subset1), 3), ] > subset2 ID fakedata 6 6 5.565669 5 5 3.973376 2 2 5.225752
David. >> sample2.df > ID fakedata > 8 8 4.998249 > 1 1 5.024332 > 3 3 5.408618 > 5 5 5.158660 > 10 A3 4.159490 > > As you can see, one way I get nothing other than an error, the other > way I > get a sample that doesn't exclude rows that were already included in > the 1st > sample. > > Thanks, > Matt Dubins > > -- > View this message in context: http://r.789695.n4.nabble.com/Random-sample-from-a-data-frame-where-ID-column-values-don-t-match-the-values-in-an-ID-column-in-a-se-tp4516448p4518878.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. David Winsemius, MD West Hartford, CT