Skip to content

Random selection from a subsample

2 messages · Tom Wilding, David Winsemius

#
Dear Mailing List

I have a data set (data4) consisting of a number of factors and a response variable.  I wish to randomly sample from a combination of two of those factors (GIS_station and Distance_code2) and return a new dataframe containing the original data structure (i.e. all the columns) but only containing the randomly selected rows.  The number of rows in each combination of GIS_station and Distance_code2 vary (widely) and some combinations are absent.   

This is getting there:: 
with (data4,{
sub_sample10=by(data4,list(GIS_station,Distance_code2), function(x) {sample(1:nrow(x),10,replace=T)})
})

....but just generates two random numbers from the range 1:nrow(x).  It doesn't return the selected rows, which is what I want.

I'm sure I could this could be done in an elegant manner, using a subscript e.g.
 
sub_sample10 = data4 [sample (1:nrow (data4), size=10), ] 

only somehow combining it with the 'by' statement (e.g. by (data4, list (GIS_station, Distance_code2).......)) but I cannot get this to work.  

Any guidance on this much appreciated.

Thankyou.
#
On Dec 19, 2010, at 5:31 AM, Tom Wilding wrote:

            
Only 2? Your argument to sample is 10.
And those row numbers would not refer to the order in the original  
sample either but would be referring within the . You have not yet  
done a very good job of specifying what sampling strategy is needed.  
At the moment you seem to be working toward a strategy that would  
potentially be very uneven in terms of the probabilities that members  
of different combinations would get into the sample, since the number  
being chosen is fixed and the number to be chosen from "varies  
widely". Is that really what you want?
(You also have not provided a reproducible data example. Next time  
bring data.)

Theis works to sample 3 from each of the the distinct categories in  
the warpbreaks data object:

by(warpbreaks, list(warpbreaks$wool, warpbreaks$tension),  
FUN=function(x) x[sample(1:nrow(x), 3), ] )   #returns a list with 6  
members each of which has a three row dataframe

And this would stick them back together in on dataframe:

  do.call(rbind, by(warpbreaks, list(warpbreaks$wool, warpbreaks 
$tension), FUN=function(x) x[sample(1:nrow(x), 3), ] ) )