Skip to content

proposed change to 'sample'

7 messages · Patrick Burns, Hadley Wickham, William Dunlap +2 more

#
There is a weakness in the 'sample'
function that is highlighted in the
help file.  The 'x' argument can be
either the vector from which to sample,
or the maximum value of the sequence
from which to sample.

This can be ambiguous if the length of
'x' is one.

I propose adding an argument that allows
the user (programmer) to avoid that
ambiguity:

function (x, size, replace = FALSE, prob = NULL,
     max = length(x) == 1L && is.numeric(x) && x >= 1)
{
     if (max) {
         if (missing(size))
             size <- x
         .Internal(sample(x, size, replace, prob))
     }
     else {
         if (missing(size))
             size <- length(x)
         x[.Internal(sample(length(x), size, replace, prob))]
     }
}
<environment: namespace:base>


This just takes the condition of the first
'if' to be the default value of the new 'max'
argument.

So in the "surprise" section of the examples
in the 'sample' help file

sample(x[x > 9])

and

sample(x[x > 9], max=FALSE)

have different behaviours.

By the way, I'm certainly not convinced that
'max' is the best name for the argument.
#
S+'s sample() has an argument 'n' to achieve
the same result.  It has been there since at
least 2005 (S+ 7.0.6).  sample(n=n) means to
return a sample from seq_along(n), where n must
be a scalar nonnegative integer.  sample(x=x)
retains it old ambiguous meaning.
  sample(x, size = n, replace = F, prob = NULL, n = NULL, ...)

S+ also has an rsample function where n (with
the same meaning) is the only way to specify the
population.  It also has an order=TRUE/FALSE argument
where order=TRUE means to randomly order the output.
order=FALSE means that the ordering of the output is
unspecified, but it allows the person writing rsample
methods to use the quickest way to get a random sample
(for big data it can be fastest to return the sample
from 1:n in increasing order). 
  rsample(n, size = n, replace = F, prob = NULL,
        bigdata = F, minimal = NULL, ..., order = T)
I like the idea of separating the concepts of sampling
and permuting data.  Many statistics are invariant to
ordering of the data and it can be a waste of time
to randomly order a sample to feed to such functions.

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
#
William Dunlap wrote:
....
Hmm, that doesn't really solve the issue does it? I.e., you still cannot
conveniently sample from a vector that is possibly of size 1.

I would be more inclined to make sampling from a vector the normal case,
and default x to say 1:max(n, size), forcing users to say sample(n=5) if
sampling from x=1:5 is desired. This could be a manageable change; the
deprecation sequence is a bit painful to think through, though.
#
Don't we already have sample.int for that case?

Hadley
#
I think that the breaking of old code was why we
allowed the user to use an unambiguous sample(n=n),
but didn't change how sample(x=scalar) worked.
Internally, we had long discouraged using sample(x=vector)
because of the ambiguity problem, preferring
x[sample(length(x),...)].

I notice that S+'s rsample() does not allow sampling
from a vector, only from seq_len(n).  I think that
is because it was felt that sampling rows from a data.frame
(or the bigdata equivalent, bdframe) was a more common
operation and the code was simpler/faster if rsample didn't
have to call out to possible subscripting methods.  Relaxing
the requirement that the output be a randomly permuted
sample was a bigger requirement when dealing with long
datasets.

In any case, I was just stating that if sample were
changed to allow disambiguation of its first argument,
using 'n' instead of 'max' would be compatible with S+.

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
#
Hadley Wickham wrote:
For the 2nd case, yes, but I was aiming at getting sample(x) ==
x[sample.int(length(x))] also in the length 1 case, removing the
ambiguity. This would obviously break some code, but I'd expect not all
that much. However, it cannot be changed in one go, we'd need to go
through a sequence where we (e.g.)

1. warn about length(x)==1
2. say that length(x)==1 is deprecated
3. have length(x)==1 throw an error
4. wait....
5. give length(x)==1 a new meaning
#
On Mon, Jun 21, 2010 at 1:57 AM, Peter Dalgaard <pdalgd at gmail.com> wrote:
Please implement this sequence!

Kjetil