Is aggregate() what I need here?

Thu, Mar 3, 2005 11:46 PM

I'm pretty new to R, and I've been given a script by a user who wants 
some help with it.  I know enough about the way R works to know that 
this is a very inefficient way to do what the user wants (the 
LSB_JOBINDEX stuff is added by me so that this can work on many 
hundreds of input data files as LSF jobs - it's the nested loops I'm 
really interested in):

gtpfile<-paste("genotype_chunk.", sep="", Sys.getenv("LSB_JOBINDEX"))
snps<-read.table(gtpfile,header=TRUE)
exp<-read.table("data.TXT",header=TRUE)
# the above two files have columns for individuals and snps (or genes) 
for rows
# so the next two lines simply transpose these data matrices
tsnps<-t(snps)
texp<-t(exp)
sink(paste("output.", sep="", Sys.getenv("LSB_JOBINDEX")))
#loops below are hardwired for 5 gene-expression levels (some genes 
have two
#probes, and those are treated as separate genes for now) and 100 SNPs.
for (iexp in 1:5){
    for (isnp in 1:100){
    genotype<-factor(tsnps[,isnp])
#     make sure there is more than one genotypic class before doing 
ANOVA
    if (length(unique(genotype))>1)  {
       expression<-texp[,iexp]
       stuff<-anova(lm(expression ~ genotype))
       qq=c(iexp,isnp)
       print (qq)
       print(stuff)
#     plot(factor(tsnps[,isnp]),texp[,iexp], xlab="Genotype", 
ylab="Expression
  level")
    }
}
}
sink()

Eventually, on the full data set, the size of the two data sets being 
related to each other will be much larger - several hundred thousand 
rows in snps, and several hundred rows in exp.

Clearly, the genotype<-factor line is being calculated repeatedly, and 
in the wrong place; it should be done en masse up front once the data 
has been read, and I'm sure both loops could actually be expressed some 
other way.

It seems to me that I ought to be able to calculate all the factors in 
a statement before the iexp loop, presumably by using apply() or 
aggregate(), but apply() seems to coerce the result so the results 
aren't factors any more, and I'm clearly missing something with regard 
to the way aggregate() works; I really don't understand what I'd need 
to set the 'by' parameter to.

I apologise if this is a hopelessly naive question, but I've got both 
the "Introduction to R" and Peter's introductory statistics with R book 
here, and I'm having some difficulty finding the information I'm after. 
"Introduction to R" doesn't mention the word "aggregate" once... :-)

Thanks in advance,

Tim

Dr Tim Cutts
Informatics Systems Group, Wellcome Trust Sanger Institute
GPG: 1024D/E3134233 FE3D 6C73 BBD6 726A A3F5  860B 3CDD 3F56 E313 4233