Hello, I am stuck with selecting the right rows from a data frame. I think the problem is rather how to select them then how to implement the R code. Consider the following data frame: df <- data.frame(ID = c(1,2,3,4,5,6,7,8,9,10), value = c(34,12,23,25,34,42,48,29,30,27)) What I want to achieve is to select 7 rows (values) so that the mean value of those rows are closest to the value of 35 and the remaining 3 rows (values) are closest to 45. However, each value is only allowed to be sampled once! Any ideas, how to achieve that? Cheers -- View this message in context: http://r.789695.n4.nabble.com/select-rows-by-criteria-tp4434812p4434812.html Sent from the R help mailing list archive at Nabble.com.
select rows by criteria
7 messages · syrvn, Petr Savicky, Rui Barradas
Hello, syrvn wrote
Hello, I am stuck with selecting the right rows from a data frame. I think the problem is rather how to select them then how to implement the R code. Consider the following data frame: df <- data.frame(ID = c(1,2,3,4,5,6,7,8,9,10), value = c(34,12,23,25,34,42,48,29,30,27)) What I want to achieve is to select 7 rows (values) so that the mean value of those rows are closest to the value of 35 and the remaining 3 rows (values) are closest to 45. However, each value is only allowed to be sampled once! Any ideas, how to achieve that? Cheers
See ?combn It gives all possible combinations as a matrix (default) or list. Then, 'apply'. #--------------------------- # Name changed to 'DF', # 'df' is the R function for the F distribution density # (and a frequent choice for example data in R-help!) # DF <- data.frame(ID = c(1,2,3,4,5,6,7,8,9,10), value = c(34,12,23,25,34,42,48,29,30,27)) f <- function(j, v, const) abs(mean(v[j]) - const) inxmat <- with(DF, combn(ID, 7)) meansDist1 <- apply(inxmat, 2, function(jnx) f(jnx, DF$value, 35)) (i1 <- which(meansDist1 == min(meansDist1))) inxmat <- with(DF, combn(ID, 3)) meansDist2 <- apply(inxmat, 2, function(jnx) f(jnx, DF$value, 45)) (i2 <- which(meansDist2 == min(meansDist2))) meansDist3 <- meansDist1 + meansDist2 # Compromise of both criteria? (i3 <- which(meansDist3 == min(meansDist3))) Maybe it's combn(1:10, 3)[, 101] you want, or maybe there's another way to compromise the two criteria. Hope this helps, Rui Barradas -- View this message in context: http://r.789695.n4.nabble.com/select-rows-by-criteria-tp4434812p4435257.html Sent from the R help mailing list archive at Nabble.com.
Sorry, correction: The second index matrix is the matrix of elements not in the first, not another combination, this time 3 out of 10. Change this in my first post
inxmat <- with(DF, combn(ID, 3)) meansDist2 <- apply(inxmat, 2, function(jnx) f(jnx, DF$value, 45)) (i2 <- which(meansDist2 == min(meansDist2)))
to this inxmat2 <- with(DF, apply(inxmat, 2, function(x) setdiff(ID, x))) meansDist2 <- apply(inxmat2, 2, function(jnx) f(jnx, DF$value, 45)) (i2 <- which(meansDist2 == min(meansDist2))) Rui Barradas -- View this message in context: http://r.789695.n4.nabble.com/select-rows-by-criteria-tp4434812p4435408.html Sent from the R help mailing list archive at Nabble.com.
On Thu, Mar 01, 2012 at 04:27:45AM -0800, syrvn wrote:
Hello, I am stuck with selecting the right rows from a data frame. I think the problem is rather how to select them then how to implement the R code. Consider the following data frame: df <- data.frame(ID = c(1,2,3,4,5,6,7,8,9,10), value = c(34,12,23,25,34,42,48,29,30,27)) What I want to achieve is to select 7 rows (values) so that the mean value of those rows are closest to the value of 35 and the remaining 3 rows (values) are closest to 45. However, each value is only allowed to be sampled once!
Hi. If some 3 rows have mean close to 45, then they have sum close to 3*45, so the remaining 7 rows have sum close to sum(df$value) - 3*45 # [1] 169 and they have mean close to 169/7 = 24.14286. In other words, the two criteria cannot be optimized together. For this reason, let me choose the criterion on 3 rows. The closest solution may be found as follows. # generate all triples and compute their means tripleMeans <- colMeans(combn(df$value, 3)) # select the index of the triple with mean closest to 35 indClosest <- which.min(abs(tripleMeans - 35)) # generate the indices, which form the closest triple in df$value tripleInd <- combn(1:length(df$value), 3)[, indClosest] tripleInd # [1] 1 3 7 # check the mean of the triple mean(df$value[tripleInd]) # [1] 35 This code constructs all triples. If it is used for k-tuples for a larger k and for a set of n values, its complexity will be proportional to choose(n, k), so it will be large even for moderate n, k. It is hard to provide a significant speed up, since some variants of "knapsack problem", which is NP-complete, may be reduced to your question. Consequently, it is, in general, NP-complete. Hope this helps. Petr Savicky.
On Thu, Mar 01, 2012 at 05:42:48PM +0100, Petr Savicky wrote:
On Thu, Mar 01, 2012 at 04:27:45AM -0800, syrvn wrote:
Hello, I am stuck with selecting the right rows from a data frame. I think the problem is rather how to select them then how to implement the R code. Consider the following data frame: df <- data.frame(ID = c(1,2,3,4,5,6,7,8,9,10), value = c(34,12,23,25,34,42,48,29,30,27)) What I want to achieve is to select 7 rows (values) so that the mean value of those rows are closest to the value of 35 and the remaining 3 rows (values) are closest to 45. However, each value is only allowed to be sampled once!
Hi. If some 3 rows have mean close to 45, then they have sum close to 3*45, so the remaining 7 rows have sum close to sum(df$value) - 3*45 # [1] 169 and they have mean close to 169/7 = 24.14286. In other words, the two criteria cannot be optimized together. For this reason, let me choose the criterion on 3 rows. The closest solution may be found as follows. # generate all triples and compute their means tripleMeans <- colMeans(combn(df$value, 3)) # select the index of the triple with mean closest to 35 indClosest <- which.min(abs(tripleMeans - 35))
I am sorry. There should be 45 and not 35. indClosest <- which.min(abs(tripleMeans - 45)) # generate the indices, which form the closest triple in df$value tripleInd <- combn(1:length(df$value), 3)[, indClosest] tripleInd # [1] 1 6 7 # check the mean of the triple mean(df$value[tripleInd]) # [1] 41.33333 Petr Savicky.
On Thu, Mar 01, 2012 at 05:42:48PM +0100, Petr Savicky wrote:
On Thu, Mar 01, 2012 at 04:27:45AM -0800, syrvn wrote:
Hello, I am stuck with selecting the right rows from a data frame. I think the problem is rather how to select them then how to implement the R code. Consider the following data frame: df <- data.frame(ID = c(1,2,3,4,5,6,7,8,9,10), value = c(34,12,23,25,34,42,48,29,30,27)) What I want to achieve is to select 7 rows (values) so that the mean value of those rows are closest to the value of 35 and the remaining 3 rows (values) are closest to 45. However, each value is only allowed to be sampled once!
Hi. If some 3 rows have mean close to 45, then they have sum close to 3*45, so the remaining 7 rows have sum close to sum(df$value) - 3*45 # [1] 169 and they have mean close to 169/7 = 24.14286. In other words, the two criteria cannot be optimized together. For this reason, let me choose the criterion on 3 rows. The closest solution may be found as follows. # generate all triples and compute their means tripleMeans <- colMeans(combn(df$value, 3)) # select the index of the triple with mean closest to 35 indClosest <- which.min(abs(tripleMeans - 35)) # generate the indices, which form the closest triple in df$value tripleInd <- combn(1:length(df$value), 3)[, indClosest] tripleInd # [1] 1 3 7 # check the mean of the triple mean(df$value[tripleInd]) # [1] 35 This code constructs all triples. If it is used for k-tuples for a larger k and for a set of n values, its complexity will be proportional to choose(n, k), so it will be large even for moderate n, k. It is hard to provide a significant speed up, since some variants of "knapsack problem", which is NP-complete, may be reduced to your question. Consequently, it is, in general, NP-complete.
Hi. Also this statement requires a correction. It applies to the search of an exact optimum if the numbers in df$value are large. There are efficient algorithms, which find an approximate solution. Also, if the numbers in df$value are integers (or may be rounded to integers after an appropriate scaling), then there is an algorithm, whose complexity is O(k*n*max(df$value)). This may be significantly less than choose(n, k). CRAN task view Optimization and Mathematical Programming http://cran.at.r-project.org/web/views/Optimization.html may suggest also other solutions. Petr Savicky.
Hello, again. Petr Savicky wrote
On Thu, Mar 01, 2012 at 05:42:48PM +0100, Petr Savicky wrote:
On Thu, Mar 01, 2012 at 04:27:45AM -0800, syrvn wrote:
Hello, I am stuck with selecting the right rows from a data frame. I think the problem is rather how to select them then how to implement the R code. Consider the following data frame: df <- data.frame(ID = c(1,2,3,4,5,6,7,8,9,10), value = c(34,12,23,25,34,42,48,29,30,27)) What I want to achieve is to select 7 rows (values) so that the mean
value
of those rows are closest to the value of 35 and the remaining 3 rows (values) are closest to 45. However, each value is only allowed to be sampled once!
Hi. If some 3 rows have mean close to 45, then they have sum close to 3*45, so the remaining 7 rows have sum close to sum(df$value) - 3*45 # [1] 169 and they have mean close to 169/7 = 24.14286. In other words, the two criteria cannot be optimized together. For this reason, let me choose the criterion on 3 rows. The closest solution may be found as follows. # generate all triples and compute their means tripleMeans <- colMeans(combn(df$value, 3)) # select the index of the triple with mean closest to 35 indClosest <- which.min(abs(tripleMeans - 35))
I am sorry. There should be 45 and not 35. indClosest <- which.min(abs(tripleMeans - 45)) # generate the indices, which form the closest triple in df$value tripleInd <- combn(1:length(df$value), 3)[, indClosest] tripleInd # [1] 1 6 7 # check the mean of the triple mean(df$value[tripleInd]) # [1] 41.33333 Petr Savicky.
______________________________________________ R-help@ mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
There are two solutions for the 3 rows criterion, 'which.min' only finds one, the first in the order given by 'combn'. (And I've corrected my first post but still with an error) # Forgot to change the index matrix meansDist2 <- apply(inxmat2, 2, function(jnx) f(jnx, DF$value, 45)) # Two solutions (i2 <- which(meansDist2 == min(meansDist2))) inxmat2[, i2] mean(DF$value[inxmat2[, i2][, 1]]) [1] 41.33333 Petr's solution and mine give the same mean value. But use for small values of (n, k) only. Rui Barradas -- View this message in context: http://r.789695.n4.nabble.com/select-rows-by-criteria-tp4434812p4435760.html Sent from the R help mailing list archive at Nabble.com.