Skip to content

how to use 'which' inside of 'apply'?

4 messages · Nathan Piekielek, R. Michael Weylandt, William Dunlap

#
Hello R-community,

I am trying to populate a column (doy) in a large dataset with the first
column number that exceeds the value in another column (thold) using the
'apply' function.

Sample data:
     pt D1 D17 D33 D49 D65 D81 D97 D113   D129   D145      D161      D177
D193   D209   D225   D241   D257
1 39177  0   0   0   0   0   0   0    0 0.4336 0.4754 0.5340667 0.5927334
0.6514 0.6966 0.5900 0.5583 0.5676
2 39178  0   0   0   0   0   0   0    0 0.3420 0.4543 0.5397666 0.6252333
0.7107 0.7123 0.5591 0.4617 0.4206
3 39164  0   0   0   0   0   0   0    0 0.4830 0.4943 0.5740333 0.6537667
0.7335 0.6255 0.6228 0.5255 0.5436
4 39143  0   0   0   0   0   0   0    0 0.3088 0.3753 0.4466000 0.5179000
0.5892 0.6468 0.4794 0.4411 0.4307
5 39144  0   0   0   0   0   0   0    0 0.3390 0.4152 0.5147000 0.6142000
0.7137 0.6914 0.6381 0.5704 0.5619
6 39146  0   0   0   0   0   0   0    0 0.4232 0.4442 0.5084000 0.5726000
0.6368 0.5896 0.4703 0.4936 0.5353
    D273    D289   D305    D321 D337 D353    thold doy
1 0.4682 0.35115 0.2341 0.11705    0    0 0.406825   0
2 0.3867 0.25780 0.1289 0.00000    0    0 0.420600   0
3 0.5541 0.46195 0.3698 0.18490    0    0 0.459200   0
4 0.3632 0.34355 0.3239 0.00000    0    0 0.477800   0
5 0.5347 0.49760 0.4605 0.00000    0    0 0.526350   0
6 0.4067 0.39685 0.3870 0.00000    0    0 0.511900   0

For the first record in above example I would expect doy = 129.

I can achieve this with the following loop, but it takes several days to run
and there must be a more efficient solution:

for (i in (1:152000)) {
t=which(data[i,2:24]>data[i,25])
r=min(t)
data[i,26]=(r-1)*16+1
}

How do I write this using 'apply' or another function that will be more
efficient? 

I have tried the following:
data$doy=apply(which(data[,2:24]>data[,25]),1,min)

Which returns the following error message:
"Error in apply(which(new[, 2:24] > new[, 25]), 1, min) : 
  dim(X) must have a positive length"

Any help would be much appreciated.

Nathan
#
I think something like this should do it at a huge speed up, though
I'd advise you check it to make sure it does exactly what you want:
there's also nothing to guarantee that something beats the threshold,
so that might make the whole thing fall apart (though I don't think it
will)

# Sample data
df = data.frame(x = sample(5, 15,T),
			y = sample(5, 15, T),
			z = sample(5, 15,T),
			w = (1:5)/2 + 0.5,
			th = (1:5)/2,
			doy = rep(0,15))

wd <- which(df[,1:4] > df[,5], arr.ind = TRUE)
# identify all elements that beat the threshold value by their indices

wd <- wd[!duplicated(wd[,1]),]
# select only the first appearance of each "row" value in wd -- this
keeps the earliest column beating the threshold

wd <- wd[order(wd[,"row"]),]
# sort them by row

df$doy = (wd[,"col"]-1)*16 + 1
# The column transform you used.

Hope this helps,

Michael
On Mon, Oct 17, 2011 at 1:03 PM, Nathan Piekielek <npiekielek at gmail.com> wrote:
#
Try vectorizing it a bit by looping over the columns.
E.g.,

  f1 <- function (df)
  {
      # loop (backwards) over all columns in df whose
      # names start with "D" to find the earliest one
      # that is bigger than column "thold".  I tested with
      # df being a data.frame but a matrix should work too.
      i <- rep(NA_character_, nrow(df))
      colNames <- grep(value = TRUE, "^D", colnames(df))
      for (colName in rev(colNames)) {
          i[df[, colName] > df[, "thold"]] <- colName
      }
      # convert column name "D<number>" to <number>.
      doy <- as.numeric(sub("^D", "", i))
      doy
  }
  > f1(a)
  [1] 129 145 129 177 177 177

You could also try looping over rows with something like
findInterval.  If there are far fewer columns than rows
then looping over columns is generally faster.

Your sample data.frame I called 'a' and in copy-and-pastable
form (from dput()) is
a <- structure(list(pt = c(39177L, 39178L, 39164L, 39143L, 39144L,
39146L), D1 = c(0L, 0L, 0L, 0L, 0L, 0L), D17 = c(0L, 0L, 0L,
0L, 0L, 0L), D33 = c(0L, 0L, 0L, 0L, 0L, 0L), D49 = c(0L, 0L,
0L, 0L, 0L, 0L), D65 = c(0L, 0L, 0L, 0L, 0L, 0L), D81 = c(0L,
0L, 0L, 0L, 0L, 0L), D97 = c(0L, 0L, 0L, 0L, 0L, 0L), D113 = c(0L,
0L, 0L, 0L, 0L, 0L), D129 = c(0.4336, 0.342, 0.483, 0.3088, 0.339,
0.4232), D145 = c(0.4754, 0.4543, 0.4943, 0.3753, 0.4152, 0.4442
), D161 = c(0.5340667, 0.5397666, 0.5740333, 0.4466, 0.5147,
0.5084), D177 = c(0.5927334, 0.6252333, 0.6537667, 0.5179, 0.6142,
0.5726), D193 = c(0.6514, 0.7107, 0.7335, 0.5892, 0.7137, 0.6368
), D209 = c(0.6966, 0.7123, 0.6255, 0.6468, 0.6914, 0.5896),
    D225 = c(0.59, 0.5591, 0.6228, 0.4794, 0.6381, 0.4703), D241 = c(0.5583,
    0.4617, 0.5255, 0.4411, 0.5704, 0.4936), D257 = c(0.5676,
    0.4206, 0.5436, 0.4307, 0.5619, 0.5353), D273 = c(0.4682,
    0.3867, 0.5541, 0.3632, 0.5347, 0.4067), D289 = c(0.35115,
    0.2578, 0.46195, 0.34355, 0.4976, 0.39685), D305 = c(0.2341,
    0.1289, 0.3698, 0.3239, 0.4605, 0.387), D321 = c(0.11705,
    0, 0.1849, 0, 0, 0), D337 = c(0L, 0L, 0L, 0L, 0L, 0L), D353 = c(0L,
    0L, 0L, 0L, 0L, 0L), thold = c(0.406825, 0.4206, 0.4592,
    0.4778, 0.52635, 0.5119), doy = c(0L, 0L, 0L, 0L, 0L, 0L)), .Names = c("pt",
"D1", "D17", "D33", "D49", "D65", "D81", "D97", "D113", "D129",
"D145", "D161", "D177", "D193", "D209", "D225", "D241", "D257",
"D273", "D289", "D305", "D321", "D337", "D353", "thold", "doy"
), class = "data.frame", row.names = c("1", "2", "3", "4", "5",
"6"))


Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
#
Your original code works far faster when the input
is a matrix that when it is a data.frame.  Selecting
a row from a data.frame is a very slow operation,
selecting a row from a matrix is quick.  Modifying
a row or a single element in a data.frame is even
worse compared to do it on a matrix.  I compared your
original code:

f0 <- function (df)
{
    for (i in seq_len(nrow(df))) {
        t = which(df[i, 2:24] > df[i, 25])
        r = min(t)
        df[i, 26] = (r - 1) * 16 + 1
    }
    df[, 26] # for now just return the computed column
}

to one that converts relevant parts of the data.frame
df to matrices or vectors before doing the loop over
rows:

f0.a <- function (df)
{
    thold <- df[, "thold"]
    tmp <- as.matrix(df[,2:24])
    ans <- df[,26]
    for (i in seq_len(nrow(df))) {
        t = which(tmp[i,]>thold[i])
        r = min(t)
        ans[i] = (r-1)*16+1
    }
    # df[,26] <- ans
    ans
}

On a 10,000 row data.frame f0 took 47.950 seconds and f0.a took
0.140 seconds.  (f1, below, took 0.012 seconds.)



Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com