Hello R-community,
I am trying to populate a column (doy) in a large dataset with the first
column number that exceeds the value in another column (thold) using the
'apply' function.
Sample data:
pt D1 D17 D33 D49 D65 D81 D97 D113 D129 D145 D161 D177
D193 D209 D225 D241 D257
1 39177 0 0 0 0 0 0 0 0 0.4336 0.4754 0.5340667 0.5927334
0.6514 0.6966 0.5900 0.5583 0.5676
2 39178 0 0 0 0 0 0 0 0 0.3420 0.4543 0.5397666 0.6252333
0.7107 0.7123 0.5591 0.4617 0.4206
3 39164 0 0 0 0 0 0 0 0 0.4830 0.4943 0.5740333 0.6537667
0.7335 0.6255 0.6228 0.5255 0.5436
4 39143 0 0 0 0 0 0 0 0 0.3088 0.3753 0.4466000 0.5179000
0.5892 0.6468 0.4794 0.4411 0.4307
5 39144 0 0 0 0 0 0 0 0 0.3390 0.4152 0.5147000 0.6142000
0.7137 0.6914 0.6381 0.5704 0.5619
6 39146 0 0 0 0 0 0 0 0 0.4232 0.4442 0.5084000 0.5726000
0.6368 0.5896 0.4703 0.4936 0.5353
D273 D289 D305 D321 D337 D353 thold doy
1 0.4682 0.35115 0.2341 0.11705 0 0 0.406825 0
2 0.3867 0.25780 0.1289 0.00000 0 0 0.420600 0
3 0.5541 0.46195 0.3698 0.18490 0 0 0.459200 0
4 0.3632 0.34355 0.3239 0.00000 0 0 0.477800 0
5 0.5347 0.49760 0.4605 0.00000 0 0 0.526350 0
6 0.4067 0.39685 0.3870 0.00000 0 0 0.511900 0
For the first record in above example I would expect doy = 129.
I can achieve this with the following loop, but it takes several days to run
and there must be a more efficient solution:
for (i in (1:152000)) {
t=which(data[i,2:24]>data[i,25])
r=min(t)
data[i,26]=(r-1)*16+1
}
How do I write this using 'apply' or another function that will be more
efficient?
I have tried the following:
data$doy=apply(which(data[,2:24]>data[,25]),1,min)
Which returns the following error message:
"Error in apply(which(new[, 2:24] > new[, 25]), 1, min) :
dim(X) must have a positive length"
Any help would be much appreciated.
Nathan
how to use 'which' inside of 'apply'?
4 messages · Nathan Piekielek, R. Michael Weylandt, William Dunlap
I think something like this should do it at a huge speed up, though I'd advise you check it to make sure it does exactly what you want: there's also nothing to guarantee that something beats the threshold, so that might make the whole thing fall apart (though I don't think it will) # Sample data df = data.frame(x = sample(5, 15,T), y = sample(5, 15, T), z = sample(5, 15,T), w = (1:5)/2 + 0.5, th = (1:5)/2, doy = rep(0,15)) wd <- which(df[,1:4] > df[,5], arr.ind = TRUE) # identify all elements that beat the threshold value by their indices wd <- wd[!duplicated(wd[,1]),] # select only the first appearance of each "row" value in wd -- this keeps the earliest column beating the threshold wd <- wd[order(wd[,"row"]),] # sort them by row df$doy = (wd[,"col"]-1)*16 + 1 # The column transform you used. Hope this helps, Michael
On Mon, Oct 17, 2011 at 1:03 PM, Nathan Piekielek <npiekielek at gmail.com> wrote:
Hello R-community,
I am trying to populate a column (doy) in a large dataset with the first
column number that exceeds the value in another column (thold) using the
'apply' function.
Sample data:
? ? pt D1 D17 D33 D49 D65 D81 D97 D113 ? D129 ? D145 ? ? ?D161 ? ? ?D177
D193 ? D209 ? D225 ? D241 ? D257
1 39177 ?0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? ?0 0.4336 0.4754 0.5340667 0.5927334
0.6514 0.6966 0.5900 0.5583 0.5676
2 39178 ?0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? ?0 0.3420 0.4543 0.5397666 0.6252333
0.7107 0.7123 0.5591 0.4617 0.4206
3 39164 ?0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? ?0 0.4830 0.4943 0.5740333 0.6537667
0.7335 0.6255 0.6228 0.5255 0.5436
4 39143 ?0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? ?0 0.3088 0.3753 0.4466000 0.5179000
0.5892 0.6468 0.4794 0.4411 0.4307
5 39144 ?0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? ?0 0.3390 0.4152 0.5147000 0.6142000
0.7137 0.6914 0.6381 0.5704 0.5619
6 39146 ?0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? ?0 0.4232 0.4442 0.5084000 0.5726000
0.6368 0.5896 0.4703 0.4936 0.5353
? ?D273 ? ?D289 ? D305 ? ?D321 D337 D353 ? ?thold doy
1 0.4682 0.35115 0.2341 0.11705 ? ?0 ? ?0 0.406825 ? 0
2 0.3867 0.25780 0.1289 0.00000 ? ?0 ? ?0 0.420600 ? 0
3 0.5541 0.46195 0.3698 0.18490 ? ?0 ? ?0 0.459200 ? 0
4 0.3632 0.34355 0.3239 0.00000 ? ?0 ? ?0 0.477800 ? 0
5 0.5347 0.49760 0.4605 0.00000 ? ?0 ? ?0 0.526350 ? 0
6 0.4067 0.39685 0.3870 0.00000 ? ?0 ? ?0 0.511900 ? 0
For the first record in above example I would expect doy = 129.
I can achieve this with the following loop, but it takes several days to run
and there must be a more efficient solution:
for (i in (1:152000)) {
t=which(data[i,2:24]>data[i,25])
r=min(t)
data[i,26]=(r-1)*16+1
}
How do I write this using 'apply' or another function that will be more
efficient?
I have tried the following:
data$doy=apply(which(data[,2:24]>data[,25]),1,min)
Which returns the following error message:
"Error in apply(which(new[, 2:24] > new[, 25]), 1, min) :
?dim(X) must have a positive length"
Any help would be much appreciated.
Nathan
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Try vectorizing it a bit by looping over the columns.
E.g.,
f1 <- function (df)
{
# loop (backwards) over all columns in df whose
# names start with "D" to find the earliest one
# that is bigger than column "thold". I tested with
# df being a data.frame but a matrix should work too.
i <- rep(NA_character_, nrow(df))
colNames <- grep(value = TRUE, "^D", colnames(df))
for (colName in rev(colNames)) {
i[df[, colName] > df[, "thold"]] <- colName
}
# convert column name "D<number>" to <number>.
doy <- as.numeric(sub("^D", "", i))
doy
}
> f1(a)
[1] 129 145 129 177 177 177
You could also try looping over rows with something like
findInterval. If there are far fewer columns than rows
then looping over columns is generally faster.
Your sample data.frame I called 'a' and in copy-and-pastable
form (from dput()) is
a <- structure(list(pt = c(39177L, 39178L, 39164L, 39143L, 39144L,
39146L), D1 = c(0L, 0L, 0L, 0L, 0L, 0L), D17 = c(0L, 0L, 0L,
0L, 0L, 0L), D33 = c(0L, 0L, 0L, 0L, 0L, 0L), D49 = c(0L, 0L,
0L, 0L, 0L, 0L), D65 = c(0L, 0L, 0L, 0L, 0L, 0L), D81 = c(0L,
0L, 0L, 0L, 0L, 0L), D97 = c(0L, 0L, 0L, 0L, 0L, 0L), D113 = c(0L,
0L, 0L, 0L, 0L, 0L), D129 = c(0.4336, 0.342, 0.483, 0.3088, 0.339,
0.4232), D145 = c(0.4754, 0.4543, 0.4943, 0.3753, 0.4152, 0.4442
), D161 = c(0.5340667, 0.5397666, 0.5740333, 0.4466, 0.5147,
0.5084), D177 = c(0.5927334, 0.6252333, 0.6537667, 0.5179, 0.6142,
0.5726), D193 = c(0.6514, 0.7107, 0.7335, 0.5892, 0.7137, 0.6368
), D209 = c(0.6966, 0.7123, 0.6255, 0.6468, 0.6914, 0.5896),
D225 = c(0.59, 0.5591, 0.6228, 0.4794, 0.6381, 0.4703), D241 = c(0.5583,
0.4617, 0.5255, 0.4411, 0.5704, 0.4936), D257 = c(0.5676,
0.4206, 0.5436, 0.4307, 0.5619, 0.5353), D273 = c(0.4682,
0.3867, 0.5541, 0.3632, 0.5347, 0.4067), D289 = c(0.35115,
0.2578, 0.46195, 0.34355, 0.4976, 0.39685), D305 = c(0.2341,
0.1289, 0.3698, 0.3239, 0.4605, 0.387), D321 = c(0.11705,
0, 0.1849, 0, 0, 0), D337 = c(0L, 0L, 0L, 0L, 0L, 0L), D353 = c(0L,
0L, 0L, 0L, 0L, 0L), thold = c(0.406825, 0.4206, 0.4592,
0.4778, 0.52635, 0.5119), doy = c(0L, 0L, 0L, 0L, 0L, 0L)), .Names = c("pt",
"D1", "D17", "D33", "D49", "D65", "D81", "D97", "D113", "D129",
"D145", "D161", "D177", "D193", "D209", "D225", "D241", "D257",
"D273", "D289", "D305", "D321", "D337", "D353", "thold", "doy"
), class = "data.frame", row.names = c("1", "2", "3", "4", "5",
"6"))
Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
-----Original Message----- From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of R. Michael Weylandt Sent: Monday, October 17, 2011 10:32 AM To: Nathan Piekielek Cc: r-help at r-project.org Subject: Re: [R] how to use 'which' inside of 'apply'? I think something like this should do it at a huge speed up, though I'd advise you check it to make sure it does exactly what you want: there's also nothing to guarantee that something beats the threshold, so that might make the whole thing fall apart (though I don't think it will) # Sample data df = data.frame(x = sample(5, 15,T), y = sample(5, 15, T), z = sample(5, 15,T), w = (1:5)/2 + 0.5, th = (1:5)/2, doy = rep(0,15)) wd <- which(df[,1:4] > df[,5], arr.ind = TRUE) # identify all elements that beat the threshold value by their indices wd <- wd[!duplicated(wd[,1]),] # select only the first appearance of each "row" value in wd -- this keeps the earliest column beating the threshold wd <- wd[order(wd[,"row"]),] # sort them by row df$doy = (wd[,"col"]-1)*16 + 1 # The column transform you used. Hope this helps, Michael On Mon, Oct 17, 2011 at 1:03 PM, Nathan Piekielek <npiekielek at gmail.com> wrote:
Hello R-community,
I am trying to populate a column (doy) in a large dataset with the first
column number that exceeds the value in another column (thold) using the
'apply' function.
Sample data:
? ? pt D1 D17 D33 D49 D65 D81 D97 D113 ? D129 ? D145 ? ? ?D161 ? ? ?D177
D193 ? D209 ? D225 ? D241 ? D257
1 39177 ?0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? ?0 0.4336 0.4754 0.5340667 0.5927334
0.6514 0.6966 0.5900 0.5583 0.5676
2 39178 ?0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? ?0 0.3420 0.4543 0.5397666 0.6252333
0.7107 0.7123 0.5591 0.4617 0.4206
3 39164 ?0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? ?0 0.4830 0.4943 0.5740333 0.6537667
0.7335 0.6255 0.6228 0.5255 0.5436
4 39143 ?0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? ?0 0.3088 0.3753 0.4466000 0.5179000
0.5892 0.6468 0.4794 0.4411 0.4307
5 39144 ?0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? ?0 0.3390 0.4152 0.5147000 0.6142000
0.7137 0.6914 0.6381 0.5704 0.5619
6 39146 ?0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? ?0 0.4232 0.4442 0.5084000 0.5726000
0.6368 0.5896 0.4703 0.4936 0.5353
? ?D273 ? ?D289 ? D305 ? ?D321 D337 D353 ? ?thold doy
1 0.4682 0.35115 0.2341 0.11705 ? ?0 ? ?0 0.406825 ? 0
2 0.3867 0.25780 0.1289 0.00000 ? ?0 ? ?0 0.420600 ? 0
3 0.5541 0.46195 0.3698 0.18490 ? ?0 ? ?0 0.459200 ? 0
4 0.3632 0.34355 0.3239 0.00000 ? ?0 ? ?0 0.477800 ? 0
5 0.5347 0.49760 0.4605 0.00000 ? ?0 ? ?0 0.526350 ? 0
6 0.4067 0.39685 0.3870 0.00000 ? ?0 ? ?0 0.511900 ? 0
For the first record in above example I would expect doy = 129.
I can achieve this with the following loop, but it takes several days to run
and there must be a more efficient solution:
for (i in (1:152000)) {
t=which(data[i,2:24]>data[i,25])
r=min(t)
data[i,26]=(r-1)*16+1
}
How do I write this using 'apply' or another function that will be more
efficient?
I have tried the following:
data$doy=apply(which(data[,2:24]>data[,25]),1,min)
Which returns the following error message:
"Error in apply(which(new[, 2:24] > new[, 25]), 1, min) :
?dim(X) must have a positive length"
Any help would be much appreciated.
Nathan
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Your original code works far faster when the input
is a matrix that when it is a data.frame. Selecting
a row from a data.frame is a very slow operation,
selecting a row from a matrix is quick. Modifying
a row or a single element in a data.frame is even
worse compared to do it on a matrix. I compared your
original code:
f0 <- function (df)
{
for (i in seq_len(nrow(df))) {
t = which(df[i, 2:24] > df[i, 25])
r = min(t)
df[i, 26] = (r - 1) * 16 + 1
}
df[, 26] # for now just return the computed column
}
to one that converts relevant parts of the data.frame
df to matrices or vectors before doing the loop over
rows:
f0.a <- function (df)
{
thold <- df[, "thold"]
tmp <- as.matrix(df[,2:24])
ans <- df[,26]
for (i in seq_len(nrow(df))) {
t = which(tmp[i,]>thold[i])
r = min(t)
ans[i] = (r-1)*16+1
}
# df[,26] <- ans
ans
}
On a 10,000 row data.frame f0 took 47.950 seconds and f0.a took
0.140 seconds. (f1, below, took 0.012 seconds.)
Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
-----Original Message-----
From: William Dunlap
Sent: Monday, October 17, 2011 11:00 AM
To: Nathan Piekielek
Cc: r-help at r-project.org
Subject: RE: [R] how to use 'which' inside of 'apply'?
Try vectorizing it a bit by looping over the columns.
E.g.,
f1 <- function (df)
{
# loop (backwards) over all columns in df whose
# names start with "D" to find the earliest one
# that is bigger than column "thold". I tested with
# df being a data.frame but a matrix should work too.
i <- rep(NA_character_, nrow(df))
colNames <- grep(value = TRUE, "^D", colnames(df))
for (colName in rev(colNames)) {
i[df[, colName] > df[, "thold"]] <- colName
}
# convert column name "D<number>" to <number>.
doy <- as.numeric(sub("^D", "", i))
doy
}
> f1(a)
[1] 129 145 129 177 177 177
You could also try looping over rows with something like
findInterval. If there are far fewer columns than rows
then looping over columns is generally faster.
Your sample data.frame I called 'a' and in copy-and-pastable
form (from dput()) is
a <- structure(list(pt = c(39177L, 39178L, 39164L, 39143L, 39144L,
39146L), D1 = c(0L, 0L, 0L, 0L, 0L, 0L), D17 = c(0L, 0L, 0L,
0L, 0L, 0L), D33 = c(0L, 0L, 0L, 0L, 0L, 0L), D49 = c(0L, 0L,
0L, 0L, 0L, 0L), D65 = c(0L, 0L, 0L, 0L, 0L, 0L), D81 = c(0L,
0L, 0L, 0L, 0L, 0L), D97 = c(0L, 0L, 0L, 0L, 0L, 0L), D113 = c(0L,
0L, 0L, 0L, 0L, 0L), D129 = c(0.4336, 0.342, 0.483, 0.3088, 0.339,
0.4232), D145 = c(0.4754, 0.4543, 0.4943, 0.3753, 0.4152, 0.4442
), D161 = c(0.5340667, 0.5397666, 0.5740333, 0.4466, 0.5147,
0.5084), D177 = c(0.5927334, 0.6252333, 0.6537667, 0.5179, 0.6142,
0.5726), D193 = c(0.6514, 0.7107, 0.7335, 0.5892, 0.7137, 0.6368
), D209 = c(0.6966, 0.7123, 0.6255, 0.6468, 0.6914, 0.5896),
D225 = c(0.59, 0.5591, 0.6228, 0.4794, 0.6381, 0.4703), D241 = c(0.5583,
0.4617, 0.5255, 0.4411, 0.5704, 0.4936), D257 = c(0.5676,
0.4206, 0.5436, 0.4307, 0.5619, 0.5353), D273 = c(0.4682,
0.3867, 0.5541, 0.3632, 0.5347, 0.4067), D289 = c(0.35115,
0.2578, 0.46195, 0.34355, 0.4976, 0.39685), D305 = c(0.2341,
0.1289, 0.3698, 0.3239, 0.4605, 0.387), D321 = c(0.11705,
0, 0.1849, 0, 0, 0), D337 = c(0L, 0L, 0L, 0L, 0L, 0L), D353 = c(0L,
0L, 0L, 0L, 0L, 0L), thold = c(0.406825, 0.4206, 0.4592,
0.4778, 0.52635, 0.5119), doy = c(0L, 0L, 0L, 0L, 0L, 0L)), .Names = c("pt",
"D1", "D17", "D33", "D49", "D65", "D81", "D97", "D113", "D129",
"D145", "D161", "D177", "D193", "D209", "D225", "D241", "D257",
"D273", "D289", "D305", "D321", "D337", "D353", "thold", "doy"
), class = "data.frame", row.names = c("1", "2", "3", "4", "5",
"6"))
Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
-----Original Message----- From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of R. Michael Weylandt Sent: Monday, October 17, 2011 10:32 AM To: Nathan Piekielek Cc: r-help at r-project.org Subject: Re: [R] how to use 'which' inside of 'apply'? I think something like this should do it at a huge speed up, though I'd advise you check it to make sure it does exactly what you want: there's also nothing to guarantee that something beats the threshold, so that might make the whole thing fall apart (though I don't think it will) # Sample data df = data.frame(x = sample(5, 15,T), y = sample(5, 15, T), z = sample(5, 15,T), w = (1:5)/2 + 0.5, th = (1:5)/2, doy = rep(0,15)) wd <- which(df[,1:4] > df[,5], arr.ind = TRUE) # identify all elements that beat the threshold value by their indices wd <- wd[!duplicated(wd[,1]),] # select only the first appearance of each "row" value in wd -- this keeps the earliest column beating the threshold wd <- wd[order(wd[,"row"]),] # sort them by row df$doy = (wd[,"col"]-1)*16 + 1 # The column transform you used. Hope this helps, Michael On Mon, Oct 17, 2011 at 1:03 PM, Nathan Piekielek <npiekielek at gmail.com> wrote:
Hello R-community,
I am trying to populate a column (doy) in a large dataset with the first
column number that exceeds the value in another column (thold) using the
'apply' function.
Sample data:
? ? pt D1 D17 D33 D49 D65 D81 D97 D113 ? D129 ? D145 ? ? ?D161 ? ? ?D177
D193 ? D209 ? D225 ? D241 ? D257
1 39177 ?0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? ?0 0.4336 0.4754 0.5340667 0.5927334
0.6514 0.6966 0.5900 0.5583 0.5676
2 39178 ?0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? ?0 0.3420 0.4543 0.5397666 0.6252333
0.7107 0.7123 0.5591 0.4617 0.4206
3 39164 ?0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? ?0 0.4830 0.4943 0.5740333 0.6537667
0.7335 0.6255 0.6228 0.5255 0.5436
4 39143 ?0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? ?0 0.3088 0.3753 0.4466000 0.5179000
0.5892 0.6468 0.4794 0.4411 0.4307
5 39144 ?0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? ?0 0.3390 0.4152 0.5147000 0.6142000
0.7137 0.6914 0.6381 0.5704 0.5619
6 39146 ?0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? ?0 0.4232 0.4442 0.5084000 0.5726000
0.6368 0.5896 0.4703 0.4936 0.5353
? ?D273 ? ?D289 ? D305 ? ?D321 D337 D353 ? ?thold doy
1 0.4682 0.35115 0.2341 0.11705 ? ?0 ? ?0 0.406825 ? 0
2 0.3867 0.25780 0.1289 0.00000 ? ?0 ? ?0 0.420600 ? 0
3 0.5541 0.46195 0.3698 0.18490 ? ?0 ? ?0 0.459200 ? 0
4 0.3632 0.34355 0.3239 0.00000 ? ?0 ? ?0 0.477800 ? 0
5 0.5347 0.49760 0.4605 0.00000 ? ?0 ? ?0 0.526350 ? 0
6 0.4067 0.39685 0.3870 0.00000 ? ?0 ? ?0 0.511900 ? 0
For the first record in above example I would expect doy = 129.
I can achieve this with the following loop, but it takes several days to run
and there must be a more efficient solution:
for (i in (1:152000)) {
t=which(data[i,2:24]>data[i,25])
r=min(t)
data[i,26]=(r-1)*16+1
}
How do I write this using 'apply' or another function that will be more
efficient?
I have tried the following:
data$doy=apply(which(data[,2:24]>data[,25]),1,min)
Which returns the following error message:
"Error in apply(which(new[, 2:24] > new[, 25]), 1, min) :
?dim(X) must have a positive length"
Any help would be much appreciated.
Nathan
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.