Skip to content

Counting occurences of variables in a dataframe

6 messages · Kai Mx, Tal Galili, David Winsemius +1 more

#
On Sat, Feb 11, 2012 at 07:17:54PM +0100, Kai Mx wrote:
Hi.

Is the first 2 in the new variable due to the fact that
the name is "ab" and "ab" at row 5 has older date? If so,
then try the following

  ind <- order(kdata$kdate)
  f <- function(x) seq.int(along.with=x)
  kdata$x <- ave(1:nrow(kdata), kdata$knames[ind], FUN=f)[order(ind)]

     knames      kdate x
  1      ab 2011-10-01 2
  2      aa 2011-11-02 2
  3      ac 2010-10-01 1
  4      ad 2010-03-15 1
  5      ab 2010-12-01 1
  6      ac 2011-01-05 2
  7      aa 2010-10-01 1
  8      ad 2011-05-04 2
  9      ae 2011-06-03 1
  10     af 2011-02-01 1

kdata$knames[ind] orders the names by increasing date.
ave(...)[order(ind)] reorders the output of ave() to the original order.

Hope this helps.

Petr Savicky.
#
On Feb 11, 2012, at 1:17 PM, Kai Mx wrote:

            
>  ave(unclass(kdate), knames, FUN=order )
  [1] 2 2 1 1 1 2 1 2 1 1


That was actually not using the dataframe values but you could also do  
this:

 > kdata$ord <- with(kdata, ave(unclass(kdate), knames, FUN=order ))
 > kdata
    knames      kdate ord
1      ab 2011-10-01   2
2      aa 2011-11-02   2
3      ac 2010-10-01   1
4      ad 2010-03-15   1
5      ab 2010-12-01   1
6      ac 2011-01-05   2
7      aa 2010-10-01   1
8      ad 2011-05-04   2
9      ae 2011-06-03   1
10     af 2011-02-01   1
David Winsemius, MD
West Hartford, CT
#
On Sat, Feb 11, 2012 at 04:05:25PM -0500, David Winsemius wrote:
Hi.

This is a good solution, if there are at most two occurrences
of each name. If there are more occurrences, then function "order"
should be replaced by "rank". Replacing name "aa" at row 2 by "ab",
we get

  knames <-c('ab', 'ab', 'ac', 'ad', 'ab', 'ac', 'aa', 'ad','ae', 'af')
  kdate <- as.Date( c('20111001', '20111102', '20101001', '20100315',
  '20101201', '20110105', '20101001', '20110504', '20110603', '20110201'),
  format="%Y%m%d")
  kdata <- data.frame (knames, kdate)

  kdata$ord <- with(kdata, ave(unclass(kdate), knames, FUN=order))
  kdata$rank <- with(kdata, ave(unclass(kdate), knames, FUN=rank))
  kdata

     knames      kdate ord rank
  1      ab 2011-10-01   3    2
  2      ab 2011-11-02   1    3
  3      ac 2010-10-01   1    1
  4      ad 2010-03-15   1    1
  5      ab 2010-12-01   2    1
  6      ac 2011-01-05   2    2
  7      aa 2010-10-01   1    1
  8      ad 2011-05-04   2    2
  9      ae 2011-06-03   1    1
  10     af 2011-02-01   1    1

The names "ab" occur in the order row 5, row 1, row 2, so
row 1 should get index 2, row 2 index 3.

If some of the dates repeat, then rank() by default computes
the average index. In this case, the following function f()
may be used

  knames <-c('ab', 'ab', 'ac', 'ad', 'ab', 'ac', 'aa', 'ad','ae', 'af')
  kdate <- as.Date( c('20111001', '20111001', '20101001', '20100315',
  '20101201', '20110105', '20101001', '20110504', '20110603', '20110201'),
  format="%Y%m%d")
  kdata <- data.frame (knames, kdate)

  kdata$rank <- with(kdata, ave(unclass(kdate), knames, FUN=rank))
  f <- function(x) rank(x, ties.method="first")
  kdata$f <- with(kdata, ave(unclass(kdate), knames, FUN=f))
  kdata
  
     knames      kdate rank f
  1      ab 2011-10-01  2.5 2
  2      ab 2011-10-01  2.5 3
  3      ac 2010-10-01  1.0 1
  4      ad 2010-03-15  1.0 1
  5      ab 2010-12-01  1.0 1
  6      ac 2011-01-05  2.0 2
  7      aa 2010-10-01  1.0 1
  8      ad 2011-05-04  2.0 2
  9      ae 2011-06-03  1.0 1
  10     af 2011-02-01  1.0 1

Hope this helps.

Petr Savicky.