Please consider the following "toy" data matrix example, called "x"
for simplicity. There are 20 different individuals ("ID"), with
information about the alleles (A,T, G, C) at six different loci
("Locus1" - "Locus6") for each of these 20 individuals. At any
single locus (e.g., "Locus1" or "Locus2", ... or "Locus6"), the
individuals have either one allele (from the set of A,T,C,G) or one
other allele (from the set of A,T,C, G). For example, at Locus1
individuals have have either the A or T allele only; at Locus2 the
individuals can have either C or G only; at Locus3 the individuals
can have either T or G only.
IDLocus1Locus2Locus3Locus4Locus5Locus6
1AGTAAC
2AGGACC
3ACGGCC
4ACGGCC
5AGGGAC
6TGGGCC
7TCGGCC
8TCGGAC
9TGGGCC
10TCGGCC
11AGGGAC
12ACGGCC
13AGGGCC
14AGGGAC
15ACGGCC
16TCGGCC
17TGGGAC
18TGGGCC
19TGGGCC
20TCGGAC
I want to delete any columns from the dataset where the rarer of the
two alleles has a frequency of ten percent or less. In other words,
I would like to delete Locus3, Locus4, and Locus6 in this data
matrix, because the frequency of the rare allele is not greater than
ten percent (and conversely, the frequency of the common allele is
not less than ninety percent). Please note that the frequency of the
rare allele in Locus6 is equal to zero (conversely, the frequency of
the common allele is equal to one hundred percent).
Would one of you know of simple way to write this sort of code? (In
my real dataset, there are 1096 loci, so this cannot be done easily "by