PEGAS: assignment to haplotype when missing information

(Cc. to r-sig-genetics)

I'm going to modify the algorithm in haplotype.DNAbin() as follows:

1. Find the sequences that are exactly identical, so that, eg, the 3 sequences:

A-
AR
AA

would be treated as different at this step.

2. Substitute the leading and trailing "-" for N (thus keeping the alignment gaps only in the 'middle' of sequences).

3. Compute the Hamming distances among haplotypes using 5 states (A, G, C, T, and "-") and ambiguities so that, eg, d(A,R)=0, d(G,R)=0, d(A,G)=1, and so on.

4. If all these distances > 0 then exit.

5. Examine each haplotype and its distances to the others:

5a. If there is only one distance = 0, then pool them in a single haplotype and give a warning.

5b. If two or more distances are equal to zero, then keep them separate and give a message (possibly attached to the returned object).

There could be options to control this algorithm:
- exit after step 1.
- ignore step 2.

At step 5, it seems to make sense to start with the "shortest" sequences and pool them with the "longer" ones, ie, "A-" would be pooled with "AA".

Comments and suggestions are welcome.

Best,

Emmanuel

----- Le 26 F?v 20, ? 16:35, Emmanuel Paradis emmanuel.paradis at ird.fr a ?crit :