Does SQL group by have a heavy duty equivalent in R

On Sun, 31 Dec 2006, Farrel Buchinsky wrote:

 I have hundreds of humans who have undergone SNP genotyping at hundreds of
 loci. Some have even undergone the procedure twice or thrice (kind of an
 internal control).

 So obviously I need to find those replications, and confirm that the
 results
 are the same. If there is discordance then I need to address it.
Why not use  duplicated() ?
More specifically:

 	unique( IDs[ duplicated( IDs ) & ! duplicated ( cbind (IDs, SNPs ) ) ] )

gives a list of those IDs for which the SNPs in all replicates of an ID 
are not the same.
For a data.frame with 200 rows of which about 50 are duplicates and 201 
columns finding the (non) duplicates takes little time on my year old AMD 64 
running Windows XP:

 my.dat <- data.frame(ID=rep(1:100, sample(1:3,100,repl=T)))
 snp.dat <- lapply(1:200,function(x) 0:1 )
 snp.frame <- as.data.frame(do.call(cbind,snp.dat))
 my.dat <- cbind( my.dat,snp.frame[sample(nrow(my.dat))%%2+1,])
 system.time( table(duplicated(my.dat)) )
[1] 0.03 0.00 0.03   NA   NA

Finding the non-duplicated rows for which there is at least one replication:

 system.time( which( (!duplicated(my.dat)) & (my.dat$ID %in%
 names(which(table(my.dat$ID)>1)) ) ))
[1] 0.05 0.00 0.05   NA   NA

 I tried to use the aggregate function

 nr.attempts
 <-aggregate(RawSeq$GENOTYPE_ID,list(sample=RawSeq$SAMPLE_ID,assay=RawSeq$ASSAY_ID),length)
 This was simply to figure out how many times the same piece of information
 had been obtained. I ran out of patience. It took beyond forever and
 tapply
 did not perform much better. The reshape package did not help - it implied
 one was out of luck if the data was not numeric. All of my data is
 character
 or factor.

 Instead I used RODBC

 sqlSave(channel,RawSeq)
 to push the table into a Microsoft Access database
 Then a sql query, courtesy of the Microsoft Access Query Wizard a la
 design
 mode.

 SELECT RawSeq.SAMPLE_ID, RawSeq.ASSAY_ID, Min(RawSeq.GENOTYPE_ID) AS
 MinOfGENOTYPE_ID, Max(RawSeq.GENOTYPE_ID) AS MaxOfGENOTYPE_ID, Count(
 RawSeq.rownames) AS CountOfrownames
 FROM RawSeq
 WHERE (((RawSeq.GENOTYPE_ID)<>""))
 GROUP BY RawSeq.SAMPLE_ID, RawSeq.ASSAY_ID
 ORDER BY Count(RawSeq.rownames) DESC;

 This way I could easily use the minimum and maximum values to see if they
 were discordant.
 Microsoft Access handled it with aplomb. I plan to use RODBC to bring the
 result of the SQL query back into R.

 This is the first time I have seen Microsoft Access outpace R.
 Is my observation correct or am I missing something. I would much rather
 perform all data manipulation and analyses in R.

 --
 Farrel Buchinsky

  [[alternative HTML version deleted]]

 ______________________________________________
 R-help at stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

Charles C. Berry                        (858) 534-2098
                                        Dept of Family/Preventive Medicine
E mailto:cberry at tajo.ucsd.edu	         UC San Diego
http://biostat.ucsd.edu/~cberry/         La Jolla, San Diego 92093-0717

Charles C. Berry                        (858) 534-2098
                                          Dept of Family/Preventive Medicine
E mailto:cberry at tajo.ucsd.edu	         UC San Diego
http://biostat.ucsd.edu/~cberry/         La Jolla, San Diego 92093-0717

Does SQL group by have a heavy duty equivalent in R

Thread (7 messages)