Deleting columns where the frequency of values are too disparate

Richard Cotton · 2009-01-19T12:13:27Z

> Please consider the following "toy" data matrix example, called "x" > for simplicity. There are 20 different individuals ("ID"), with > information about the alleles (A,T, G, C) at six different loci > ("Locus1" - "Locus6") for each of these 20 individuals. At any > single locus (e.g., "Locus1" or "Locus2", ... or "Locus6"), the > individuals have either one allele (from the set of A,T,C,G) or one > other allele (from the set of A,T,C, G). For example, at Locus1 > individuals have hav

Richard Cotton

Mon, Jan 19, 2009 4:13 AM

eye.")

Most of the problem is just organising the data into a sensible form.

# read in data
data <- readLines(tc <- textConnection("1AGTAAC
2AGGACC
3ACGGCC
4ACGGCC
5AGGGAC
6TGGGCC
7TCGGCC
8TCGGAC
9TGGGCC
10TCGGCC
11AGGGAC
12ACGGCC
13AGGGCC
14AGGGAC
15ACGGCC
16TCGGCC
17TGGGAC
18TGGGCC
19TGGGCC
20TCGGAC")); close(tc)

# retrieve the useful bit
loci <- sub("[[:digit:]]{1,2}", "", data)

# you may also want this
ID <- grep("[[:digit:]]{1,2}", data)

# find out how many of each base occurs at each locus
freqs <- list()
n <- length(ID)
for(i in 1:6)
{
   assign(paste("locus", i, sep=""), factor(substring(loci,i,i), 
levels=c("A","C","G","T")))
   freqs[[i]] <- summary(get(paste("locus", i, sep=""))) 
}
freqs

# remove loci with 90% or more cases of same base
loci.to.remove <- sapply(freqs, function(x) any(x>0.9*n))

Regards,
Richie.

Mathematical Sciences Unit
HSL


------------------------------------------------------------------------
ATTENTION:

This message contains privileged and confidential inform...{{dropped:20}}

Deleting columns where the frequency of values are too disparate

Thread (2 messages)