An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20130702/0b95cd95/attachment.pl>
Recoding variables based on reference values in data frame
4 messages · kathleen askland, Rui Barradas, arun
Hello,
I'm not sure I understood, but try the following.
Kgeno <- read.table(text = "
SNP_ID SNP1 SNP2 SNP3 SNP4
Maj_Allele C G C A
Min_Allele T A T G
ID1 CC GG CT AA
ID2 CC GG CC AA
ID3 CC GG nc AA
ID4 _ _ _ _
ID5 CC GG CC AA
ID6 CC GG CC AA
ID7 CC GG CT AA
ID8 _ _ _ _
ID9 CT GG CC AG
ID10 CC GG CC AA
ID11 CC GG CT AA
ID12 _ _ _ _
ID13 CC GG CC AA
", header = TRUE, stringsAsFactors = FALSE)
dat
fun <- function(x){
x[x %in% c("nc", "_")] <- NA
MM <- paste0(x[1], x[1]) # Major Major
Mm <- paste0(x[1], x[2]) # Major minor
mm <- paste0(x[2], x[2]) # minor minor
x[x == MM] <- 0
x[x == Mm] <- 1
x[x == mm] <- 2
x
}
Kgeno[, -1] <- sapply(Kgeno[, -1], fun)
Kgeno
Also, the best way to post data is by using ?dput.
dput(head(Kgeno[, 1:5], 30)) # post the output of this
Hope this helps,
Rui Barradas
Em 02-07-2013 21:46, kathleen askland escreveu:
I'm new to R (previously used SAS primarily) and I have a genetics data
frame consisting of genotypes for each of 300+ subjects (ID1, ID2, ID3,
...) at 3000+ genetic locations (SNP1, SNP2, SNP3...). A small subset of
the data is shown below:
SNP_ID SNP1 SNP2 SNP3 SNP4 Maj_Allele C G C A Min_Allele T A T G ID1
CC GG CT AA ID2 CC GG CC AA ID3 CC GG
nc
AA ID4 _ _ _ _ ID5 CC GG CC AA ID6 CC GG CC
AA ID7 CC GG CT AA ID8 _ _ _ _ ID9 CT GG
CC AG ID10 CC GG CC AA ID11 CC GG CT AA
ID12 _ _ _ _ ID13 CC GG CC AA
The name of the data file is Kgeno.
What I would like to do is recode all of the genotype values to standard
integer notation, based on their values relative to the reference rows
(Maj_Allele and Min_Allele). Standard notation sums the total of minor
alleles in the genotype, so values can be 0, 1 or 2.
Here are the changes I want to make:
1. If the genotype= "nc" or '_" then set equal to NA.
2. If genotype value = a character string comprised of two consecutive
major allele values -- c(Maj_Allele, Maj_Allele) -- then set equal to 0.
3. If genotype value= c(Maj_Allele, Min_Allele) then set equal to 1.
4. If genotype value = c(Min_Allele, Min_Allele) then set equal to 2.
I've tried the following ifelse processing but get error (Warning: Executed
script did not end with R session at the top-level prompt. Top-level state
will be restored) and can't seem to fix the code properly. I've counted the
parentheses. Also, not sure if it would execute properly if I could fix it.
# change 'nc' and '_' to NA, else leave as is:
Kgeno[,2] <- ifelse(Kgeno[,2] == "nc", "NA", Kgeno[,2])
Kgeno[,2] <- ifelse(Kgeno[,2] == "_", "NA", Kgeno[,2])
#convert genotype strings in the first data column to numeric values #(two
major alleles=0, 1 minor and 1 major=1, 2 minor alleles=2), else #leave as
is (to preserve NA values).
Kgeno[,2] <-
ifelse(Kgeno[,2] == noquote(paste(as.character(Kgeno[1,2]), as.character(
Kgeno[1,2]), sep=""), 0,
ifelse(Kgeno[,2] == noquote(paste(as.character(Kgeno[1,2]), as.character(
Kgeno[2,2]), sep=""), 1,
ifelse(Kgeno[,2] == noquote(paste(as.character(Kgeno[2,2]), as.character(
Kgeno[2,2]), sep=""), 2,
Kgeno[,2])))
Finally, if above code were corrected, this would only change the first
column of data, but I would like to change all 3000+ columns in the same
way.
I would greatly appreciate some suggestions on how to proceed.
Thank you,
Kathleen
---
Kathleen Askland, MD
Assistant Professor
Department of Psychiatry & Human Behavior
The Warren Alpert School of Medicine
Brown University/Butler Hospital
[[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Hello,
If you have read in the data as factors (stringsAsFactors = TRUE, the
default), change the function to the following.
fun <- function(x){
x[x %in% c("nc", "_")] <- NA
MM <- paste0(as.character(x[1]), as.character(x[1])) # Major Major
Mm <- paste0(as.character(x[1]), as.character(x[2])) # Major minor
mm <- paste0(as.character(x[2]), as.character(x[2])) # minor minor
x[x == MM] <- 0
x[x == Mm] <- 1
x[x == mm] <- 2
x
}
Rui Barradas
Em 02-07-2013 22:15, Rui Barradas escreveu:
Hello,
I'm not sure I understood, but try the following.
Kgeno <- read.table(text = "
SNP_ID SNP1 SNP2 SNP3 SNP4
Maj_Allele C G C A
Min_Allele T A T G
ID1 CC GG CT AA
ID2 CC GG CC AA
ID3 CC GG nc AA
ID4 _ _ _ _
ID5 CC GG CC AA
ID6 CC GG CC AA
ID7 CC GG CT AA
ID8 _ _ _ _
ID9 CT GG CC AG
ID10 CC GG CC AA
ID11 CC GG CT AA
ID12 _ _ _ _
ID13 CC GG CC AA
", header = TRUE, stringsAsFactors = FALSE)
dat
fun <- function(x){
x[x %in% c("nc", "_")] <- NA
MM <- paste0(x[1], x[1]) # Major Major
Mm <- paste0(x[1], x[2]) # Major minor
mm <- paste0(x[2], x[2]) # minor minor
x[x == MM] <- 0
x[x == Mm] <- 1
x[x == mm] <- 2
x
}
Kgeno[, -1] <- sapply(Kgeno[, -1], fun)
Kgeno
Also, the best way to post data is by using ?dput.
dput(head(Kgeno[, 1:5], 30)) # post the output of this
Hope this helps,
Rui Barradas
Em 02-07-2013 21:46, kathleen askland escreveu:
I'm new to R (previously used SAS primarily) and I have a genetics data
frame consisting of genotypes for each of 300+ subjects (ID1, ID2, ID3,
...) at 3000+ genetic locations (SNP1, SNP2, SNP3...). A small subset of
the data is shown below:
SNP_ID SNP1 SNP2 SNP3 SNP4 Maj_Allele C G C A Min_Allele T A T
G ID1
CC GG CT AA ID2 CC GG CC AA ID3 CC GG
nc
AA ID4 _ _ _ _ ID5 CC GG CC AA ID6 CC
GG CC
AA ID7 CC GG CT AA ID8 _ _ _ _ ID9 CT GG
CC AG ID10 CC GG CC AA ID11 CC GG CT AA
ID12 _ _ _ _ ID13 CC GG CC AA
The name of the data file is Kgeno.
What I would like to do is recode all of the genotype values to standard
integer notation, based on their values relative to the reference rows
(Maj_Allele and Min_Allele). Standard notation sums the total of minor
alleles in the genotype, so values can be 0, 1 or 2.
Here are the changes I want to make:
1. If the genotype= "nc" or '_" then set equal to NA.
2. If genotype value = a character string comprised of two consecutive
major allele values -- c(Maj_Allele, Maj_Allele) -- then set equal to 0.
3. If genotype value= c(Maj_Allele, Min_Allele) then set equal to 1.
4. If genotype value = c(Min_Allele, Min_Allele) then set equal to 2.
I've tried the following ifelse processing but get error (Warning:
Executed
script did not end with R session at the top-level prompt. Top-level
state
will be restored) and can't seem to fix the code properly. I've
counted the
parentheses. Also, not sure if it would execute properly if I could
fix it.
# change 'nc' and '_' to NA, else leave as is:
Kgeno[,2] <- ifelse(Kgeno[,2] == "nc", "NA", Kgeno[,2])
Kgeno[,2] <- ifelse(Kgeno[,2] == "_", "NA", Kgeno[,2])
#convert genotype strings in the first data column to numeric values
#(two
major alleles=0, 1 minor and 1 major=1, 2 minor alleles=2), else
#leave as
is (to preserve NA values).
Kgeno[,2] <-
ifelse(Kgeno[,2] == noquote(paste(as.character(Kgeno[1,2]), as.character(
Kgeno[1,2]), sep=""), 0,
ifelse(Kgeno[,2] == noquote(paste(as.character(Kgeno[1,2]), as.character(
Kgeno[2,2]), sep=""), 1,
ifelse(Kgeno[,2] == noquote(paste(as.character(Kgeno[2,2]), as.character(
Kgeno[2,2]), sep=""), 2,
Kgeno[,2])))
Finally, if above code were corrected, this would only change the first
column of data, but I would like to change all 3000+ columns in the same
way.
I would greatly appreciate some suggestions on how to proceed.
Thank you,
Kathleen
---
Kathleen Askland, MD
Assistant Professor
Department of Psychiatry & Human Behavior
The Warren Alpert School of Medicine
Brown University/Butler Hospital
[[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Hi,
May be this helps:
Kgeno<- read.table(text="
SNP_ID SNP1 SNP2 SNP3 SNP4
Maj_Allele C G? C? A
Min_Allele T A T G?
ID1 CC??? GG??? CT??? AA
ID2 CC??? GG??? CC AA
ID3 CC??? GG nc? AA
ID4? _? _? _? _
ID5 CC??? GG??? CC??? AA
ID6 CC??? GG??? CC? AA
ID7 CC??? GG??? CT??? AA
ID8 _ _ _ _?
ID9 CT??? GG? CC AG
ID10 CC??? GG??? CC??? AA
ID11 CC??? GG??? CT??? AA
ID12 _ _ _ _?
ID13 CC??? GG??? CC??? AA
",sep="",header=TRUE,stringsAsFactors=FALSE)
library(stringr)
library(car)
fun1<- function(x){
?MajMin<- paste0(x[1],x[2])
?MajMaj<-str_dup(x[1],2)
?MinMin<-str_dup(x[2],2)
?recode(x,"'nc'=NA;'_'=NA;MajMaj=0;MajMin=1;MinMin=2")}
sapply(Kgeno[,-1],fun1)
#or
?mat1<-sapply(Kgeno[1:2,-1],function(x) {c(str_dup(x,2),paste(x,collapse=""))})[c(1,3,2),]
sapply(seq_len(ncol(Kgeno[,-1])),function(i) {x<-Kgeno[-c(1:2),-1][,i];as.numeric(factor(x,levels=mat1[,i]))-1})
#Speed comparison
KgenoNew<- rbind(Kgeno[c(1:2),-1],sapply(Kgeno[-c(1:2),-1],rep,1e4))
?system.time(res1<- sapply(KgenoNew,fun1))
#?? user? system elapsed
?#0.672?? 0.000?? 0.674
system.time({
mat1<-sapply(Kgeno[1:2,-1],function(x) {c(str_dup(x,2),paste(x,collapse=""))})[c(1,3,2),]
res2<- sapply(seq_len(ncol(KgenoNew)),function(i){ x<- KgenoNew[-c(1:2),][,i];as.numeric(factor(x,levels=mat1[,i]))-1})
})
#user? system elapsed
#? 0.212?? 0.000?? 0.214
res1New<- res1[-c(1:2),]
res1New1<- as.numeric(res1New)
?dim(res1New1)<- dim(res1New)
identical(res1New1,res2)
#[1] TRUE
A.K.
----- Original Message -----
From: kathleen askland <k.askland at gmail.com>
To: r-help at r-project.org
Cc:
Sent: Tuesday, July 2, 2013 4:46 PM
Subject: [R] Recoding variables based on reference values in data frame
I'm new to R (previously used SAS primarily) and I have a genetics data
frame consisting of genotypes for each of 300+ subjects (ID1, ID2, ID3,
...) at 3000+ genetic locations (SNP1, SNP2, SNP3...). A small subset of
the data is shown below:
? SNP_ID SNP1 SNP2 SNP3 SNP4? Maj_Allele C G? C? A? Min_Allele T A T G? ID1
CC? ? GG? ? CT? ? AA? ? ? ID2 CC? ? GG? ? CC AA? ? ? ID3 CC? ? GG
nc
AA? ? ? ID4 _ _ _ _? ID5 CC? ? GG? ? CC? ? AA? ? ? ID6 CC? ? GG? ? CC
? ? AA? ? ? ID7 CC? ? GG? ? CT? ? AA? ? ? ID8 _ _ _ _? ID9 CT? ? GG
CC AG? ? ? ID10 CC? ? GG? ? CC? ? AA? ? ? ID11 CC? ? GG? ? CT? ? AA
? ? ? ID12 _ _ _ _? ID13 CC? ? GG? ? CC? ? AA
The name of the data file is Kgeno.
What I would like to do is recode all of the genotype values to standard
integer notation, based on their values relative to the reference rows
(Maj_Allele and Min_Allele). Standard notation sums the total of minor
alleles in the genotype, so values can be 0, 1 or 2.
Here are the changes I want to make:
1. If the genotype= "nc" or '_" then set equal to NA.
2. If genotype value = a character string comprised of two consecutive
major allele values -- c(Maj_Allele, Maj_Allele) -- then set equal to 0.
3. If genotype? value= c(Maj_Allele, Min_Allele) then set equal to 1.
4. If genotype? value = c(Min_Allele, Min_Allele) then set equal to 2.
I've tried the following ifelse processing but get error (Warning: Executed
script did not end with R session at the top-level prompt.? Top-level state
will be restored) and can't seem to fix the code properly. I've counted the
parentheses. Also, not sure if it would execute properly if I could fix it.
# change 'nc' and '_' to NA, else leave as is:
Kgeno[,2] <- ifelse(Kgeno[,2] == "nc", "NA", Kgeno[,2])
Kgeno[,2] <- ifelse(Kgeno[,2] == "_", "NA", Kgeno[,2])
#convert genotype strings in the first data column to numeric values #(two
major alleles=0, 1 minor and 1 major=1, 2 minor alleles=2), else #leave as
is (to preserve NA values).
Kgeno[,2] <-
ifelse(Kgeno[,2] == noquote(paste(as.character(Kgeno[1,2]), as.character(
Kgeno[1,2]), sep=""), 0,
ifelse(Kgeno[,2] == noquote(paste(as.character(Kgeno[1,2]), as.character(
Kgeno[2,2]), sep=""), 1,
ifelse(Kgeno[,2] == noquote(paste(as.character(Kgeno[2,2]), as.character(
Kgeno[2,2]), sep=""), 2,
? ? ? ? ? ? Kgeno[,2])))
Finally, if above code were corrected, this would only change the first
column of data, but I would like to change all 3000+ columns in the same
way.
I would greatly appreciate some suggestions on how to proceed.
Thank you,
Kathleen
---
Kathleen Askland, MD
Assistant Professor
Department of Psychiatry & Human Behavior
The Warren Alpert School of Medicine
Brown University/Butler Hospital
??? [[alternative HTML version deleted]]
______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.