Skip to content
Prev 294935 / 398502 Next

Complex sort problem

On Fri, May 18, 2012 at 06:37:03AM -0400, Axel Urbiz wrote:
The suggestion

  set.seed(12345)
  x <- sample(0:100, 10)
  x.order <- order(x)
  x.sorted <- x[x.order]

  sample.ind <- sample(1:length(x), 5, replace = TRUE)  #sample 1/2 size with replacement

  x.sample <- x.sorted[sample.ind]
  freq <- tabulate(sample.ind, nbins=length(x))
  x.sample.sorted <- rep(x.sorted, times=freq)

uses the fact that rep(x.sorted, times=freq) keeps the order
in x.sorted. This x.sorted can be a data frame, in which
case we should use 

  sample.ind <- sample(1:nrow(x), 5, replace = TRUE)
  x.sample <- x.sorted[sample.ind, ]
  freq <- tabulate(sample.ind, nbins=nrow(x))
  x.sample.sorted <- x.sorted[rep(1:nrow(x.sorted), times=freq), ]

It is possible to have several x.sorted data frames sorted according
to different variables. In this case, we generate pairs x.sample and
x.sample.sorted which are the same sample once unsorted and once sorted.
However, we get different samples for each sorting variable.

In order to save CPU time, if the same sample should be sortable
by different variables, try the following. Calculate the order of
the original data according to each relevant variable and store them
as rank vectors determining the order of cases. Then, instead of
sorting a data frame representing a sample, determine the order from
the corresponding subset of the rank vector. This may be faster and
produces the same order.

Petr Savicky.