Does SQL group by have a heavy duty equivalent in R

An embedded and charset-unspecified text was scrubbed...
Name: not available
Url: https://stat.ethz.ch/pipermail/r-help/attachments/20061231/3bfdc727/attachment.pl
nr.attempts
<-aggregate(RawSeq$GENOTYPE_ID,list(sample=RawSeq$SAMPLE_ID,assay=RawSeq$ASSAY_ID),length)
This was simply to figure out how many times the same piece of information
had been obtained. I ran out of patience. It took beyond forever and tapply
did not perform much better. The reshape package did not help - it implied
one was out of luck if the data was not numeric. All of my data is character
or factor.
The reshape package will work if all your data is numeric, or all of
it is character - it doesn't work with a mix.  I will try and make
this more clear in the documentation.
However, depending on the size and structure of your data it may not
be any faster than tapply or aggregate.

Hadley
I converted the whole data frame to character by using
as.matrix

And then using a posting that explained how to get the naming conventions 
back (which had been lost when converting to matrix)

Anything that I did not list with the id's it insisted in including them 
with the measured variables. In other words it would not let me drop.

despite

melted<-melt(BigDF, id=c("SAMPLE_ID","ASSAY_ID"), 
measured=c("GENOTYPE_ID","DESCRIPTION"))

unique(melted$variable)
 [1] CUSTOMER       PROJECT        PLATE          EXPERIMENT     CHIP 
WELL_POSITION  GENOTYPE_ID    DESCRIPTION    ENTRY_OPERATOR
[10] INTERACT       PLATEc
Levels: CUSTOMER PROJECT PLATE EXPERIMENT CHIP WELL_POSITION GENOTYPE_ID 
DESCRIPTION ENTRY_OPERATOR INTERACT PLATEc

I should have only got GENOTYPE_ID    and DESCRIPTION

"hadley wickham" <h.wickham at gmail.com> wrote in message 
news:f8e6ff050612310758p11f96c0dl256ac5b15d11dc2c at mail.gmail.com...
nr.attempts
<-aggregate(RawSeq$GENOTYPE_ID,list(sample=RawSeq$SAMPLE_ID,assay=RawSeq$ASSAY_ID),length)
This was simply to figure out how many times the same piece of 
information
had been obtained. I ran out of patience. It took beyond forever and 
tapply
did not perform much better. The reshape package did not help - it 
implied
one was out of luck if the data was not numeric. All of my data is 
character
or factor.
The reshape package will work if all your data is numeric, or all of
it is character - it doesn't work with a mix.  I will try and make
this more clear in the documentation.
However, depending on the size and structure of your data it may not
be any faster than tapply or aggregate.

Hadley

______________________________________________
R-help at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

I converted the whole data frame to character by using
as.matrix
You shouldn't need to do that.
And then using a posting that explained how to get the naming conventions
back (which had been lost when converting to matrix)

Anything that I did not list with the id's it insisted in including them
with the measured variables. In other words it would not let me drop.

despite

melted<-melt(BigDF, id=c("SAMPLE_ID","ASSAY_ID"),
measured=c("GENOTYPE_ID","DESCRIPTION"))
That should be measure=c(...)

Hadley
An embedded and charset-unspecified text was scrubbed...
Name: not available
Url: https://stat.ethz.ch/pipermail/r-help/attachments/20061231/70c12810/attachment.pl

I have hundreds of humans who have undergone SNP genotyping at hundreds of
loci. Some have even undergone the procedure twice or thrice (kind of an
internal control).

So obviously I need to find those replications, and confirm that the results
are the same. If there is discordance then I need to address it.
Why not use  duplicated() ?

For a data.frame with 200 rows of which about 50 are duplicates and 201 
columns finding the (non) duplicates takes little time on my year old AMD 
64 running Windows XP:
my.dat <- data.frame(ID=rep(1:100, sample(1:3,100,repl=T)))
snp.dat <- lapply(1:200,function(x) 0:1 )
snp.frame <- as.data.frame(do.call(cbind,snp.dat))
my.dat <- cbind( my.dat,snp.frame[sample(nrow(my.dat))%%2+1,])
system.time( table(duplicated(my.dat)) )
[1] 0.03 0.00 0.03   NA   NA

Finding the non-duplicated rows for which there is at least one 
replication:
system.time( which( (!duplicated(my.dat)) & (my.dat$ID %in% names(which(table(my.dat$ID)>1)) ) ))
[1] 0.05 0.00 0.05   NA   NA

I tried to use the aggregate function

nr.attempts
<-aggregate(RawSeq$GENOTYPE_ID,list(sample=RawSeq$SAMPLE_ID,assay=RawSeq$ASSAY_ID),length)
This was simply to figure out how many times the same piece of information
had been obtained. I ran out of patience. It took beyond forever and tapply
did not perform much better. The reshape package did not help - it implied
one was out of luck if the data was not numeric. All of my data is character
or factor.

Instead I used RODBC

sqlSave(channel,RawSeq)
to push the table into a Microsoft Access database
Then a sql query, courtesy of the Microsoft Access Query Wizard a la design
mode.

SELECT RawSeq.SAMPLE_ID, RawSeq.ASSAY_ID, Min(RawSeq.GENOTYPE_ID) AS
MinOfGENOTYPE_ID, Max(RawSeq.GENOTYPE_ID) AS MaxOfGENOTYPE_ID, Count(
RawSeq.rownames) AS CountOfrownames
FROM RawSeq
WHERE (((RawSeq.GENOTYPE_ID)<>""))
GROUP BY RawSeq.SAMPLE_ID, RawSeq.ASSAY_ID
ORDER BY Count(RawSeq.rownames) DESC;

This way I could easily use the minimum and maximum values to see if they
were discordant.
Microsoft Access handled it with aplomb. I plan to use RODBC to bring the
result of the SQL query back into R.

This is the first time I have seen Microsoft Access outpace R.
Is my observation correct or am I missing something. I would much rather
perform all data manipulation and analyses in R.

-- 
Farrel Buchinsky

	[[alternative HTML version deleted]]

______________________________________________
R-help at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Charles C. Berry                        (858) 534-2098
                                          Dept of Family/Preventive Medicine
E mailto:cberry at tajo.ucsd.edu	         UC San Diego
http://biostat.ucsd.edu/~cberry/         La Jolla, San Diego 92093-0717

On Sun, 31 Dec 2006, Farrel Buchinsky wrote:

 I have hundreds of humans who have undergone SNP genotyping at hundreds of
 loci. Some have even undergone the procedure twice or thrice (kind of an
 internal control).

 So obviously I need to find those replications, and confirm that the
 results
 are the same. If there is discordance then I need to address it.
Why not use  duplicated() ?
More specifically:

 	unique( IDs[ duplicated( IDs ) & ! duplicated ( cbind (IDs, SNPs ) ) ] )

gives a list of those IDs for which the SNPs in all replicates of an ID 
are not the same.
For a data.frame with 200 rows of which about 50 are duplicates and 201 
columns finding the (non) duplicates takes little time on my year old AMD 64 
running Windows XP:

 my.dat <- data.frame(ID=rep(1:100, sample(1:3,100,repl=T)))
 snp.dat <- lapply(1:200,function(x) 0:1 )
 snp.frame <- as.data.frame(do.call(cbind,snp.dat))
 my.dat <- cbind( my.dat,snp.frame[sample(nrow(my.dat))%%2+1,])
 system.time( table(duplicated(my.dat)) )
[1] 0.03 0.00 0.03   NA   NA

Finding the non-duplicated rows for which there is at least one replication:

 system.time( which( (!duplicated(my.dat)) & (my.dat$ID %in%
 names(which(table(my.dat$ID)>1)) ) ))
[1] 0.05 0.00 0.05   NA   NA

 I tried to use the aggregate function

 nr.attempts
 <-aggregate(RawSeq$GENOTYPE_ID,list(sample=RawSeq$SAMPLE_ID,assay=RawSeq$ASSAY_ID),length)
 This was simply to figure out how many times the same piece of information
 had been obtained. I ran out of patience. It took beyond forever and
 tapply
 did not perform much better. The reshape package did not help - it implied
 one was out of luck if the data was not numeric. All of my data is
 character
 or factor.

 Instead I used RODBC

 sqlSave(channel,RawSeq)
 to push the table into a Microsoft Access database
 Then a sql query, courtesy of the Microsoft Access Query Wizard a la
 design
 mode.

 SELECT RawSeq.SAMPLE_ID, RawSeq.ASSAY_ID, Min(RawSeq.GENOTYPE_ID) AS
 MinOfGENOTYPE_ID, Max(RawSeq.GENOTYPE_ID) AS MaxOfGENOTYPE_ID, Count(
 RawSeq.rownames) AS CountOfrownames
 FROM RawSeq
 WHERE (((RawSeq.GENOTYPE_ID)<>""))
 GROUP BY RawSeq.SAMPLE_ID, RawSeq.ASSAY_ID
 ORDER BY Count(RawSeq.rownames) DESC;

 This way I could easily use the minimum and maximum values to see if they
 were discordant.
 Microsoft Access handled it with aplomb. I plan to use RODBC to bring the
 result of the SQL query back into R.

 This is the first time I have seen Microsoft Access outpace R.
 Is my observation correct or am I missing something. I would much rather
 perform all data manipulation and analyses in R.

 --
 Farrel Buchinsky

  [[alternative HTML version deleted]]

 ______________________________________________
 R-help at stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

Charles C. Berry                        (858) 534-2098
                                        Dept of Family/Preventive Medicine
E mailto:cberry at tajo.ucsd.edu	         UC San Diego
http://biostat.ucsd.edu/~cberry/         La Jolla, San Diego 92093-0717

Charles C. Berry                        (858) 534-2098
                                          Dept of Family/Preventive Medicine
E mailto:cberry at tajo.ucsd.edu	         UC San Diego
http://biostat.ucsd.edu/~cberry/         La Jolla, San Diego 92093-0717