Skip to content

ape, read.dna to phase fasta file not working properly, tajima

2 messages · Ella Bowles

#
?
 fasta8c18.fa
<https://drive.google.com/file/d/0B6qb8IlaQGFZX0Q0YzNUVGNHV1E/view?usp=drive_web>
?Hello,

I'm trying to complete the very simple task of reading in an unphased fasta
file and phasing it using ape, and then calculating Tajima's D using pegas,
but my data doesn't seem to be reading in correctly. Input and output is as
follows:
library("ape")
library("adegenet")
library("ade4")
library("pegas")
Warning message:
In data(DNAbin8c18) : data set ?DNAbin8c18? not found

##clearly the data is not read in properly, so looked at what had been
loaded
817452 DNA sequences in binary format stored in a matrix.

All sequences of same length: 96

Labels:
CLocus_12706_Sample_1_Locus_34105_Allele_0 [BayOfIslands_s08...
CLocus_12706_Sample_2_Locus_31118_Allele_0 [BayOfIslands_s08...
CLocus_12706_Sample_3_Locus_30313_Allele_0 [BayOfIslands_s09...
CLocus_12706_Sample_5_Locus_33345_Allele_0 [BayOfIslands_s09...
CLocus_12706_Sample_7_Locus_37388_Allele_0 [BayOfIslands_s09...
CLocus_12706_Sample_8_Locus_29451_Allele_0 [BayOfIslands_s09...
...

More than 10 million nucleotides: not printing base composition

##although likely won't work, trying taj d test to see what happens
Error: cannot allocate vector of size 2489.3 Gb

?I'm sending the datafile along as a link as well.

Any thoughts would be much appreciated.

Ella?
1 day later
#
Hello,

I wanted to send a follow-up note to say that the developer helped me with
my problem. His reply was
The problem is that your data are too big (too many sequences) and
tajima.test() needs to compute the matrix of all pairwise distances. You
could this check by trying:

dist.dna(DNAbin8c18, "N")

One possibility for you is to sample randomly some observations, and repeat
this many times, eg:

tajima.test(DNAbin8c18[sample(n, size = 1000), ])

This could be:

N <- 1000 # number of repeats
RES <- matrix(N, 3)
for (i in 1:N)
    RES[, i] <- unlist(tajima.test(DNAbin8c18[sample(n, size = 10000), ]))

You may adjust N and 'size =' to have something not too long to run. Then
you may look at the distribution of the columns of RES.
On Wed, Aug 30, 2017 at 4:49 PM, Ella Bowles <bowlese at gmail.com> wrote: