Hi,
I'm having difficulty coming up with a good way to subest some data to generate statistics.
My data frame has multiple observations by group.
Here is an overly-simplified toy example of the data
==========================
code v1 v2
G1 1.2 2.3
G1 0 2.4
G1 1.4 3.4
G2 2.9 2.3
G2 4.3 4.4
etc..
=========================
I want to normalize the data *by group* for certain variable. But, I want to ignore 0 values when calculating the mean and standard deviation.
What I *want* to do is something like this:
=======================
for (code in unique (d$code) ){
mu <- mean( d[which(d[d$code==code,v1] !=0 ), v1] )
sig <- sd( d[which(d[d$code==code,v1] !=0 ), v1] )
d[which(d[d$code==code,v1] !=0 ), cname] <- (d[which(d[d$code==code,v1] !=0 ), v1] - mu) / sig
}
=======================
My goal, if it isn't apparent, is to replace values with their normalized value. (But, the statistics used for normalization are calculated skipping zero values.)
This doesn't work as the indexing from the which command is relative (1,2,3, etc.)
Suggestions?
--
Noah Silverman
UCLA Department of Statistics
8208 Math Sciences Building
Los Angeles, CA 90095
Difficult subset challenge
2 messages · Noah Silverman, Joshua Wiley
Hi Noah,
I am unclear if the 0s should be standardized or not---I am assuming
since you want them excluded from the calculation of the mean and SD,
you do not want (0 - M) / sigma. If that is the case, here is an
example:
## read in your data
## FYI: providing via dput() would be easier next time
d <- read.table(textConnection("
code v1 v2
G1 1.2 2.3
G1 0 2.4
G1 1.4 3.4
G2 2.9 2.3
G2 4.3 4.4"), header = TRUE)
closeAllConnections()
## temporary data as a matrix
tmp <- as.matrix(d[-1])
## index 0s and set to missing
tmp[index.0 <- which(tmp == 0, arr.ind = TRUE)] <- NA
## scale by column and d$code and pull back to matrix
tmp <- do.call("rbind", by(tmp, d$code, scale))
## NAs back to 0s
tmp[index.0] <- 0
d[, 2:3] <- tmp
If you want the zeros standardized, it will take a bit of a different
approach. The other issue that could come up here is speed, but that
can get to be very dataset dependent (e.g., what is most efficient for
a few levels of code may not be the same as what is efficient for many
columns, etc. That said, it would not take much work to create a
parallelized version of what by() is doing, and scale is already
vectorized so it works pretty darn fast assuming you pass it a matrix.
Cheers,
Josh
On Sat, Dec 10, 2011 at 1:44 PM, Noah Silverman <noahsilverman at ucla.edu> wrote:
Hi,
I'm having difficulty coming up with a good way to subest some data to generate statistics.
My data frame has multiple observations by group.
Here is an overly-simplified toy example of the data
==========================
code ? ?v1 ? ? ?v2
G1 ? ? ? ? ? ? ?1.2 ? ? 2.3
G1 ? ? ? ? ? ? ?0 ? ? ? 2.4
G1 ? ? ? ? ? ? ?1.4 ? ? 3.4
G2 ? ? ? ? ? ? ?2.9 ? ? 2.3
G2 ? ? ? ? ? ? ?4.3 ? ? 4.4
etc..
=========================
I want to normalize the data *by group* ?for certain variable. ?But, I want to ignore 0 values when calculating the mean and standard deviation.
What I *want* to do is something like this:
=======================
? ? ? ? for (code in unique (d$code) ){
? ? ? ? ? ? ? ? mu <- mean( d[which(d[d$code==code,v1] !=0 ), v1] )
? ? ? ? ? ? ? ? sig <- sd( d[which(d[d$code==code,v1] !=0 ), v1] )
? ? ? ? ? ? ? ? d[which(d[d$code==code,v1] !=0 ), cname] <- (d[which(d[d$code==code,v1] !=0 ), v1] - mu) / sig
? ? ? ? }
=======================
My goal, if it isn't apparent, is to replace values with their normalized value. ?(But, the statistics used for normalization are calculated skipping zero values.)
This doesn't work as the indexing from the which command is relative (1,2,3, etc.)
Suggestions?
--
Noah Silverman
UCLA Department of Statistics
8208 Math Sciences Building
Los Angeles, CA 90095
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Joshua Wiley Ph.D. Student, Health Psychology Programmer Analyst II, Statistical Consulting Group University of California, Los Angeles https://joshuawiley.com/