An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/r-help/attachments/20061231/3bfdc727/attachment.pl
Does SQL group by have a heavy duty equivalent in R
7 messages · Farrel Buchinsky, Hadley Wickham, Charles C. Berry
nr.attempts <-aggregate(RawSeq$GENOTYPE_ID,list(sample=RawSeq$SAMPLE_ID,assay=RawSeq$ASSAY_ID),length) This was simply to figure out how many times the same piece of information had been obtained. I ran out of patience. It took beyond forever and tapply did not perform much better. The reshape package did not help - it implied one was out of luck if the data was not numeric. All of my data is character or factor.
The reshape package will work if all your data is numeric, or all of it is character - it doesn't work with a mix. I will try and make this more clear in the documentation. However, depending on the size and structure of your data it may not be any faster than tapply or aggregate. Hadley
I converted the whole data frame to character by using
as.matrix
And then using a posting that explained how to get the naming conventions
back (which had been lost when converting to matrix)
Anything that I did not list with the id's it insisted in including them
with the measured variables. In other words it would not let me drop.
despite
melted<-melt(BigDF, id=c("SAMPLE_ID","ASSAY_ID"),
measured=c("GENOTYPE_ID","DESCRIPTION"))
unique(melted$variable)
[1] CUSTOMER PROJECT PLATE EXPERIMENT CHIP
WELL_POSITION GENOTYPE_ID DESCRIPTION ENTRY_OPERATOR
[10] INTERACT PLATEc
Levels: CUSTOMER PROJECT PLATE EXPERIMENT CHIP WELL_POSITION GENOTYPE_ID
DESCRIPTION ENTRY_OPERATOR INTERACT PLATEc
I should have only got GENOTYPE_ID and DESCRIPTION
"hadley wickham" <h.wickham at gmail.com> wrote in message
news:f8e6ff050612310758p11f96c0dl256ac5b15d11dc2c at mail.gmail.com...
nr.attempts <-aggregate(RawSeq$GENOTYPE_ID,list(sample=RawSeq$SAMPLE_ID,assay=RawSeq$ASSAY_ID),length) This was simply to figure out how many times the same piece of information had been obtained. I ran out of patience. It took beyond forever and tapply did not perform much better. The reshape package did not help - it implied one was out of luck if the data was not numeric. All of my data is character or factor.
The reshape package will work if all your data is numeric, or all of it is character - it doesn't work with a mix. I will try and make this more clear in the documentation. However, depending on the size and structure of your data it may not be any faster than tapply or aggregate. Hadley
______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
I converted the whole data frame to character by using as.matrix
You shouldn't need to do that.
And then using a posting that explained how to get the naming conventions
back (which had been lost when converting to matrix)
Anything that I did not list with the id's it insisted in including them
with the measured variables. In other words it would not let me drop.
despite
melted<-melt(BigDF, id=c("SAMPLE_ID","ASSAY_ID"),
measured=c("GENOTYPE_ID","DESCRIPTION"))
That should be measure=c(...) Hadley
An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/r-help/attachments/20061231/70c12810/attachment.pl
On Sun, 31 Dec 2006, Farrel Buchinsky wrote:
I have hundreds of humans who have undergone SNP genotyping at hundreds of loci. Some have even undergone the procedure twice or thrice (kind of an internal control). So obviously I need to find those replications, and confirm that the results are the same. If there is discordance then I need to address it.
Why not use duplicated() ? For a data.frame with 200 rows of which about 50 are duplicates and 201 columns finding the (non) duplicates takes little time on my year old AMD 64 running Windows XP:
my.dat <- data.frame(ID=rep(1:100, sample(1:3,100,repl=T))) snp.dat <- lapply(1:200,function(x) 0:1 ) snp.frame <- as.data.frame(do.call(cbind,snp.dat)) my.dat <- cbind( my.dat,snp.frame[sample(nrow(my.dat))%%2+1,]) system.time( table(duplicated(my.dat)) )
[1] 0.03 0.00 0.03 NA NA
Finding the non-duplicated rows for which there is at least one replication:
system.time( which( (!duplicated(my.dat)) & (my.dat$ID %in% names(which(table(my.dat$ID)>1)) ) ))
[1] 0.05 0.00 0.05 NA NA
I tried to use the aggregate function nr.attempts <-aggregate(RawSeq$GENOTYPE_ID,list(sample=RawSeq$SAMPLE_ID,assay=RawSeq$ASSAY_ID),length) This was simply to figure out how many times the same piece of information had been obtained. I ran out of patience. It took beyond forever and tapply did not perform much better. The reshape package did not help - it implied one was out of luck if the data was not numeric. All of my data is character or factor. Instead I used RODBC sqlSave(channel,RawSeq) to push the table into a Microsoft Access database Then a sql query, courtesy of the Microsoft Access Query Wizard a la design mode. SELECT RawSeq.SAMPLE_ID, RawSeq.ASSAY_ID, Min(RawSeq.GENOTYPE_ID) AS MinOfGENOTYPE_ID, Max(RawSeq.GENOTYPE_ID) AS MaxOfGENOTYPE_ID, Count( RawSeq.rownames) AS CountOfrownames FROM RawSeq WHERE (((RawSeq.GENOTYPE_ID)<>"")) GROUP BY RawSeq.SAMPLE_ID, RawSeq.ASSAY_ID ORDER BY Count(RawSeq.rownames) DESC; This way I could easily use the minimum and maximum values to see if they were discordant. Microsoft Access handled it with aplomb. I plan to use RODBC to bring the result of the SQL query back into R. This is the first time I have seen Microsoft Access outpace R. Is my observation correct or am I missing something. I would much rather perform all data manipulation and analyses in R. -- Farrel Buchinsky [[alternative HTML version deleted]]
______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Charles C. Berry (858) 534-2098
Dept of Family/Preventive Medicine
E mailto:cberry at tajo.ucsd.edu UC San Diego
http://biostat.ucsd.edu/~cberry/ La Jolla, San Diego 92093-0717
On Sun, 31 Dec 2006, Charles C. Berry wrote:
On Sun, 31 Dec 2006, Farrel Buchinsky wrote:
I have hundreds of humans who have undergone SNP genotyping at hundreds of loci. Some have even undergone the procedure twice or thrice (kind of an internal control). So obviously I need to find those replications, and confirm that the results are the same. If there is discordance then I need to address it.
Why not use duplicated() ?
More specifically: unique( IDs[ duplicated( IDs ) & ! duplicated ( cbind (IDs, SNPs ) ) ] ) gives a list of those IDs for which the SNPs in all replicates of an ID are not the same.
For a data.frame with 200 rows of which about 50 are duplicates and 201 columns finding the (non) duplicates takes little time on my year old AMD 64 running Windows XP:
my.dat <- data.frame(ID=rep(1:100, sample(1:3,100,repl=T))) snp.dat <- lapply(1:200,function(x) 0:1 ) snp.frame <- as.data.frame(do.call(cbind,snp.dat)) my.dat <- cbind( my.dat,snp.frame[sample(nrow(my.dat))%%2+1,]) system.time( table(duplicated(my.dat)) )
[1] 0.03 0.00 0.03 NA NA
Finding the non-duplicated rows for which there is at least one replication:
system.time( which( (!duplicated(my.dat)) & (my.dat$ID %in% names(which(table(my.dat$ID)>1)) ) ))
[1] 0.05 0.00 0.05 NA NA
I tried to use the aggregate function nr.attempts <-aggregate(RawSeq$GENOTYPE_ID,list(sample=RawSeq$SAMPLE_ID,assay=RawSeq$ASSAY_ID),length) This was simply to figure out how many times the same piece of information had been obtained. I ran out of patience. It took beyond forever and tapply did not perform much better. The reshape package did not help - it implied one was out of luck if the data was not numeric. All of my data is character or factor. Instead I used RODBC sqlSave(channel,RawSeq) to push the table into a Microsoft Access database Then a sql query, courtesy of the Microsoft Access Query Wizard a la design mode. SELECT RawSeq.SAMPLE_ID, RawSeq.ASSAY_ID, Min(RawSeq.GENOTYPE_ID) AS MinOfGENOTYPE_ID, Max(RawSeq.GENOTYPE_ID) AS MaxOfGENOTYPE_ID, Count( RawSeq.rownames) AS CountOfrownames FROM RawSeq WHERE (((RawSeq.GENOTYPE_ID)<>"")) GROUP BY RawSeq.SAMPLE_ID, RawSeq.ASSAY_ID ORDER BY Count(RawSeq.rownames) DESC; This way I could easily use the minimum and maximum values to see if they were discordant. Microsoft Access handled it with aplomb. I plan to use RODBC to bring the result of the SQL query back into R. This is the first time I have seen Microsoft Access outpace R. Is my observation correct or am I missing something. I would much rather perform all data manipulation and analyses in R. -- Farrel Buchinsky [[alternative HTML version deleted]]
______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Charles C. Berry (858) 534-2098
Dept of Family/Preventive Medicine
E mailto:cberry at tajo.ucsd.edu UC San Diego
http://biostat.ucsd.edu/~cberry/ La Jolla, San Diego 92093-0717
Charles C. Berry (858) 534-2098
Dept of Family/Preventive Medicine
E mailto:cberry at tajo.ucsd.edu UC San Diego
http://biostat.ucsd.edu/~cberry/ La Jolla, San Diego 92093-0717