Skip to content
Prev 279958 / 398513 Next

Difficult subset challenge

Hi Noah,

I am unclear if the 0s should be standardized or not---I am assuming
since you want them excluded from the calculation of the mean and SD,
you do not want  (0 - M) / sigma.  If that is the case, here is an
example:


## read in your data
## FYI: providing via dput() would be easier next time
d <- read.table(textConnection("
code    v1      v2
G1              1.2     2.3
G1              0       2.4
G1              1.4     3.4
G2              2.9     2.3
G2              4.3     4.4"), header = TRUE)
closeAllConnections()

## temporary data as a matrix
tmp <- as.matrix(d[-1])
## index 0s and set to missing
tmp[index.0 <- which(tmp == 0, arr.ind = TRUE)] <- NA
## scale by column and d$code and pull back to matrix
tmp <- do.call("rbind", by(tmp, d$code, scale))
## NAs back to 0s
tmp[index.0] <- 0
d[, 2:3] <- tmp

If you want the zeros standardized, it will take a bit of a different
approach.  The other issue that could come up here is speed, but that
can get to be very dataset dependent (e.g., what is most efficient for
a few levels of code may not be the same as what is efficient for many
columns, etc.  That said, it would not take much work to create a
parallelized version of what by() is doing, and scale is already
vectorized so it works pretty darn fast assuming you pass it a matrix.

Cheers,

Josh
On Sat, Dec 10, 2011 at 1:44 PM, Noah Silverman <noahsilverman at ucla.edu> wrote: