Dear Mailing List
I have a data set (data4) consisting of a number of factors and a response variable. I wish to randomly sample from a combination of two of those factors (GIS_station and Distance_code2) and return a new dataframe containing the original data structure (i.e. all the columns) but only containing the randomly selected rows. The number of rows in each combination of GIS_station and Distance_code2 vary (widely) and some combinations are absent.
This is getting there::
with (data4,{
sub_sample10=by(data4,list(GIS_station,Distance_code2), function(x) {sample(1:nrow(x),10,replace=T)})
})
....but just generates two random numbers from the range 1:nrow(x). It doesn't return the selected rows, which is what I want.
I'm sure I could this could be done in an elegant manner, using a subscript e.g.
sub_sample10 = data4 [sample (1:nrow (data4), size=10), ]
only somehow combining it with the 'by' statement (e.g. by (data4, list (GIS_station, Distance_code2).......)) but I cannot get this to work.
Any guidance on this much appreciated.
Thankyou.
Random selection from a subsample
2 messages · Tom Wilding, David Winsemius
On Dec 19, 2010, at 5:31 AM, Tom Wilding wrote:
Dear Mailing List
I have a data set (data4) consisting of a number of factors and a
response variable. I wish to randomly sample from a combination of
two of those factors (GIS_station and Distance_code2) and return a
new dataframe containing the original data structure (i.e. all the
columns) but only containing the randomly selected rows. The number
of rows in each combination of GIS_station and Distance_code2 vary
(widely) and some combinations are absent.
This is getting there::
with (data4,{
sub_sample10=by(data4,list(GIS_station,Distance_code2), function(x)
{sample(1:nrow(x),10,replace=T)})
})
....but just generates two random numbers from the range 1:nrow(x).
Only 2? Your argument to sample is 10.
It doesn't return the selected rows, which is what I want.
And those row numbers would not refer to the order in the original sample either but would be referring within the . You have not yet done a very good job of specifying what sampling strategy is needed. At the moment you seem to be working toward a strategy that would potentially be very uneven in terms of the probabilities that members of different combinations would get into the sample, since the number being chosen is fixed and the number to be chosen from "varies widely". Is that really what you want?
I'm sure I could this could be done in an elegant manner, using a subscript e.g. sub_sample10 = data4 [sample (1:nrow (data4), size=10), ]
(You also have not provided a reproducible data example. Next time bring data.) Theis works to sample 3 from each of the the distinct categories in the warpbreaks data object: by(warpbreaks, list(warpbreaks$wool, warpbreaks$tension), FUN=function(x) x[sample(1:nrow(x), 3), ] ) #returns a list with 6 members each of which has a three row dataframe And this would stick them back together in on dataframe: do.call(rbind, by(warpbreaks, list(warpbreaks$wool, warpbreaks $tension), FUN=function(x) x[sample(1:nrow(x), 3), ] ) )
David. > > only somehow combining it with the 'by' statement (e.g. by (data4, > list (GIS_station, Distance_code2).......)) but I cannot get this to > work. > > Any guidance on this much appreciated. > > Thankyou. David Winsemius, MD West Hartford, CT