how to use 'which' inside of 'apply'?

data.frames are quite efficient when you use
a column at a time, but not when used a row
at a time.

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
-----Original Message-----
From: Nathan Piekielek [mailto:npiekielek at gmail.com]
Sent: Monday, October 17, 2011 12:37 PM
To: William Dunlap
Subject: RE: [R] how to use 'which' inside of 'apply'?

Wow, that dramatically improves performance. I never realized data frames
were that inefficient.

Thanks for your help, incredible difference.

Nathan

-----Original Message-----
From: William Dunlap [mailto:wdunlap at tibco.com]
Sent: Monday, October 17, 2011 1:25 PM
To: Nathan Piekielek
Cc: r-help at r-project.org
Subject: RE: [R] how to use 'which' inside of 'apply'?

Your original code works far faster when the input
is a matrix that when it is a data.frame.  Selecting
a row from a data.frame is a very slow operation,
selecting a row from a matrix is quick.  Modifying
a row or a single element in a data.frame is even
worse compared to do it on a matrix.  I compared your
original code:

f0 <- function (df)
{
    for (i in seq_len(nrow(df))) {
        t = which(df[i, 2:24] > df[i, 25])
        r = min(t)
        df[i, 26] = (r - 1) * 16 + 1
    }
    df[, 26] # for now just return the computed column
}

to one that converts relevant parts of the data.frame
df to matrices or vectors before doing the loop over
rows:

f0.a <- function (df)
{
    thold <- df[, "thold"]
    tmp <- as.matrix(df[,2:24])
    ans <- df[,26]
    for (i in seq_len(nrow(df))) {
        t = which(tmp[i,]>thold[i])
        r = min(t)
        ans[i] = (r-1)*16+1
    }
    # df[,26] <- ans
    ans
}

On a 10,000 row data.frame f0 took 47.950 seconds and f0.a took
0.140 seconds.  (f1, below, took 0.012 seconds.)

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com

-----Original Message-----
From: William Dunlap
Sent: Monday, October 17, 2011 11:00 AM
To: Nathan Piekielek
Cc: r-help at r-project.org
Subject: RE: [R] how to use 'which' inside of 'apply'?

Try vectorizing it a bit by looping over the columns.
E.g.,

  f1 <- function (df)
  {
      # loop (backwards) over all columns in df whose
      # names start with "D" to find the earliest one
      # that is bigger than column "thold".  I tested with
      # df being a data.frame but a matrix should work too.
      i <- rep(NA_character_, nrow(df))
      colNames <- grep(value = TRUE, "^D", colnames(df))
      for (colName in rev(colNames)) {
          i[df[, colName] > df[, "thold"]] <- colName
      }
      # convert column name "D<number>" to <number>.
      doy <- as.numeric(sub("^D", "", i))
      doy
  }
  > f1(a)
  [1] 129 145 129 177 177 177

You could also try looping over rows with something like
findInterval.  If there are far fewer columns than rows
then looping over columns is generally faster.

Your sample data.frame I called 'a' and in copy-and-pastable
form (from dput()) is
a <- structure(list(pt = c(39177L, 39178L, 39164L, 39143L, 39144L,
39146L), D1 = c(0L, 0L, 0L, 0L, 0L, 0L), D17 = c(0L, 0L, 0L,
0L, 0L, 0L), D33 = c(0L, 0L, 0L, 0L, 0L, 0L), D49 = c(0L, 0L,
0L, 0L, 0L, 0L), D65 = c(0L, 0L, 0L, 0L, 0L, 0L), D81 = c(0L,
0L, 0L, 0L, 0L, 0L), D97 = c(0L, 0L, 0L, 0L, 0L, 0L), D113 = c(0L,
0L, 0L, 0L, 0L, 0L), D129 = c(0.4336, 0.342, 0.483, 0.3088, 0.339,
0.4232), D145 = c(0.4754, 0.4543, 0.4943, 0.3753, 0.4152, 0.4442
), D161 = c(0.5340667, 0.5397666, 0.5740333, 0.4466, 0.5147,
0.5084), D177 = c(0.5927334, 0.6252333, 0.6537667, 0.5179, 0.6142,
0.5726), D193 = c(0.6514, 0.7107, 0.7335, 0.5892, 0.7137, 0.6368
), D209 = c(0.6966, 0.7123, 0.6255, 0.6468, 0.6914, 0.5896),
    D225 = c(0.59, 0.5591, 0.6228, 0.4794, 0.6381, 0.4703), D241 =
c(0.5583,
    0.4617, 0.5255, 0.4411, 0.5704, 0.4936), D257 = c(0.5676,
    0.4206, 0.5436, 0.4307, 0.5619, 0.5353), D273 = c(0.4682,
    0.3867, 0.5541, 0.3632, 0.5347, 0.4067), D289 = c(0.35115,
    0.2578, 0.46195, 0.34355, 0.4976, 0.39685), D305 = c(0.2341,
    0.1289, 0.3698, 0.3239, 0.4605, 0.387), D321 = c(0.11705,
    0, 0.1849, 0, 0, 0), D337 = c(0L, 0L, 0L, 0L, 0L, 0L), D353 = c(0L,
    0L, 0L, 0L, 0L, 0L), thold = c(0.406825, 0.4206, 0.4592,
    0.4778, 0.52635, 0.5119), doy = c(0L, 0L, 0L, 0L, 0L, 0L)), .Names =
c("pt",
"D1", "D17", "D33", "D49", "D65", "D81", "D97", "D113", "D129",
"D145", "D161", "D177", "D193", "D209", "D225", "D241", "D257",
"D273", "D289", "D305", "D321", "D337", "D353", "thold", "doy"
), class = "data.frame", row.names = c("1", "2", "3", "4", "5",
"6"))

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com

-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org]
On Behalf Of R. Michael
Weylandt
Sent: Monday, October 17, 2011 10:32 AM
To: Nathan Piekielek
Cc: r-help at r-project.org
Subject: Re: [R] how to use 'which' inside of 'apply'?

I think something like this should do it at a huge speed up, though
I'd advise you check it to make sure it does exactly what you want:
there's also nothing to guarantee that something beats the threshold,
so that might make the whole thing fall apart (though I don't think it
will)

# Sample data
df = data.frame(x = sample(5, 15,T),
			y = sample(5, 15, T),
			z = sample(5, 15,T),
			w = (1:5)/2 + 0.5,
			th = (1:5)/2,
			doy = rep(0,15))

wd <- which(df[,1:4] > df[,5], arr.ind = TRUE)
# identify all elements that beat the threshold value by their indices

wd <- wd[!duplicated(wd[,1]),]
# select only the first appearance of each "row" value in wd -- this
keeps the earliest column beating the threshold

wd <- wd[order(wd[,"row"]),]
# sort them by row

df$doy = (wd[,"col"]-1)*16 + 1
# The column transform you used.

Hope this helps,

Michael

On Mon, Oct 17, 2011 at 1:03 PM, Nathan Piekielek <npiekielek at gmail.com>
wrote:
Hello R-community,

I am trying to populate a column (doy) in a large dataset with the
first
column number that exceeds the value in another column (thold) using
the
'apply' function.

Sample data:
? ? pt D1 D17 D33 D49 D65 D81 D97 D113 ? D129 ? D145 ? ? ?D161
?D177
D193 ? D209 ? D225 ? D241 ? D257
1 39177 ?0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? ?0 0.4336 0.4754 0.5340667
0.5927334
0.6514 0.6966 0.5900 0.5583 0.5676
2 39178 ?0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? ?0 0.3420 0.4543 0.5397666
0.6252333
0.7107 0.7123 0.5591 0.4617 0.4206
3 39164 ?0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? ?0 0.4830 0.4943 0.5740333
0.6537667
0.7335 0.6255 0.6228 0.5255 0.5436
4 39143 ?0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? ?0 0.3088 0.3753 0.4466000
0.5179000
0.5892 0.6468 0.4794 0.4411 0.4307
5 39144 ?0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? ?0 0.3390 0.4152 0.5147000
0.6142000
0.7137 0.6914 0.6381 0.5704 0.5619
6 39146 ?0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? ?0 0.4232 0.4442 0.5084000
0.5726000
0.6368 0.5896 0.4703 0.4936 0.5353
? ?D273 ? ?D289 ? D305 ? ?D321 D337 D353 ? ?thold doy
1 0.4682 0.35115 0.2341 0.11705 ? ?0 ? ?0 0.406825 ? 0
2 0.3867 0.25780 0.1289 0.00000 ? ?0 ? ?0 0.420600 ? 0
3 0.5541 0.46195 0.3698 0.18490 ? ?0 ? ?0 0.459200 ? 0
4 0.3632 0.34355 0.3239 0.00000 ? ?0 ? ?0 0.477800 ? 0
5 0.5347 0.49760 0.4605 0.00000 ? ?0 ? ?0 0.526350 ? 0
6 0.4067 0.39685 0.3870 0.00000 ? ?0 ? ?0 0.511900 ? 0

For the first record in above example I would expect doy = 129.

I can achieve this with the following loop, but it takes several days
to run
and there must be a more efficient solution:

for (i in (1:152000)) {
t=which(data[i,2:24]>data[i,25])
r=min(t)
data[i,26]=(r-1)*16+1
}

How do I write this using 'apply' or another function that will be
more
efficient?

I have tried the following:
data$doy=apply(which(data[,2:24]>data[,25]),1,min)

Which returns the following error message:
"Error in apply(which(new[, 2:24] > new[, 25]), 1, min) :
?dim(X) must have a positive length"

Any help would be much appreciated.

Nathan

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.