Skip to content
Prev 308132 / 398503 Next

uniq -c

You said you wanted the equivalent of the Unix 'uniq -c' but said
that xtab's results were roughly right and the rle might be what
you want.  rle() is the equivalent of 'uniq -c', they both output the
lengths of runs of identical elements.   if the data is sorted they
are equivalent to using table() or xtabs().

Since you have sorted data try the following

isFirstInRun <- function(x) UseMethod("isFirstInRun")
isFirstInRun.default <- function(x) c(TRUE, x[-1] != x[-length(x)])
isFirstInRun.data.frame <- function(x) {
    stopifnot(ncol(x)>0)
    retval <- isFirstInRun(x[[1]])
    for(column in x) {
        retval <- retval | isFirstInRun(column)
    }
    retval
}

i <- which(isFirstInRun(yourDataFrame))

Then I think
   data.frame(Count=diff(c(i, 1L+nrow(yourDataFrame))), yourDataFrame[i,])
gives you what you want.   E.g.,
  > yourDataFrame <- data.frame(x1=c(1,1,2,2,1), x2=c(11,11,11,12,11))
  > i <- which(isFirstInRun(yourDataFrame))
  > i
  [1] 1 3 4 5
  > data.frame(Count=diff(c(i, 1L+nrow(yourDataFrame))), yourDataFrame[i,])
    Count x1 x2
  1     2  1 11
  3     1  2 11
  4     1  2 12
  5     1  1 11

It should be pretty quick.  If you have missing values in your data frame,
you will have to make some decisions about whether they should be
considered equal to each other or not and modify isFirstInRun.default
accordingly.

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com