Skip to content

Character SNP data to binary MAF data

5 messages · Hadassa Brunschwig, Barry Rowlingson, Thomas Lumley +1 more

#
Hi

An example is as follows. Consider the character 3x6 matrix:

a A a T A t
G g t T T t
A a C C c c

For each row I would like to identify the most frequent letter and
assign a 1 to it and 0
to the less frequent character. That is, in row 1 the most frequent
letter is A (I do not differentiate between capital and non-capital
letters), in row 2 T and in row 3 C. After the binary conversion
the resulting matrix would look like that:

1 1 1 0 1 0
0 0 1 1 1 1
0 0 1 1 1 1

Any suggestions on how to do that (and I am sure I am not the first
one to try this).

Thanks
Hadassa


On Thu, Jan 29, 2009 at 1:50 AM, Jorge Ivan Velez
<jorgeivanvelez at gmail.com> wrote:

  
    
#
2009/1/29 Hadassa Brunschwig <hadassa.brunschwig at mail.huji.ac.il>:
What if there's a tie for most frequent? Do you want 1s for all the
most frequent characters? Or choose one randomly? Or zeroes?

 Examples: what do the following become:

 A A C C T G
 A A C C T T
 A A A A A A

Or are such cases not possible?

 Some hints for you to work on this yourself:

   help('table') - the table function works out counts of elements of vectors
   help('tolower') - for changing upper to lower case
   help('apply') - for working on rows of data frames

 then check out any basic R tutorial on subscripting and replacement,
and you may need to work out how to loop over things with 'for'. You
should be able to make a working solution in a dozen or so lines of R.
Don't be surprised if some R guru on here does it in 2 or 3 lines of
dense, obfuscated stuff!

Barry
#
The first step is to convert your data to all uppercase with toupper().

Then it depends on how tidy the data are: are there missing data, are some SNPs monomorphic in your sample, etc.

If there are no missing data you can use

N<-ncol(the_data)
halfN <- N/2

maf_one_row <-function(arow) {
    rval<-numeric(N)
    if (sum(i<-arow=="A")>halfN) {
         rval[]<-1
    } else if (sum(i<-arow=="C")>halfN){
         rval[i]<-1
    } else if (sum(i<-arow=="T"))>halfN){
         rval[i]<-1
    } else if (sum(i<-arow=="G")>halfN){
         rval[i]<-1
    }
    rval
}

apply(the_data, 1, maf_one_row)

YOu could also use table() to find the two alleles, but you have to make sure that the code still works when there is only one allele.

      -thomas
On Thu, 29 Jan 2009, Hadassa Brunschwig wrote:

            
Thomas Lumley			Assoc. Professor, Biostatistics
tlumley at u.washington.edu	University of Washington, Seattle
#
Hadassa,
You may want to check out the snpMatrix package in Bioconductor

http://bioconductor.org/packages/2.3/bioc/html/snpMatrix.html
http://bioconductor.org/packages/2.4/bioc/html/snpMatrix.html

It contains classes that manage this type of information and should  
minimize your coding effort.


Patrick


Quoting Thomas Lumley <tlumley at u.washington.edu>:
#
2009/1/29 Patrick Aboyoun <paboyoun at fhcrc.org>:
It's not that much effort - this code turns all ties into 1s:

snp2maf=function(m){
m=toupper(m)
return(t(apply(m,1,makeBin)))
}

makeBin = function(chars){
tc = table(chars)
maxV = names(tc[tc==max(tc)])
matches = match(chars,maxV)
r=as.integer(!is.na(matches))
return(r)
}

 then:
[,1] [,2] [,3] [,4] [,5]
 [1,] "t"  "g"  "g"  "g"  "t"
 [2,] "a"  "G"  "a"  "C"  "c"
 [3,] "A"  "T"  "c"  "c"  "C"
 [4,] "g"  "T"  "c"  "A"  "C"
 [5,] "G"  "C"  "G"  "g"  "G"
 [6,] "G"  "t"  "T"  "a"  "C"
 [7,] "A"  "G"  "T"  "g"  "T"
 [8,] "T"  "a"  "C"  "a"  "T"
 [9,] "t"  "g"  "g"  "c"  "T"
[10,] "A"  "t"  "t"  "c"  "A"
[,1] [,2] [,3] [,4] [,5]
 [1,]    0    1    1    1    0
 [2,]    1    0    1    1    1
 [3,]    0    0    1    1    1
 [4,]    0    0    1    0    1
 [5,]    1    0    1    1    1
 [6,]    0    1    1    0    0
 [7,]    0    1    1    1    1
 [8,]    1    1    0    1    1
 [9,]    1    1    1    0    1
[10,]    1    1    1    0    1
Barry